Hive run on top of hadoop and widely used in industry to run queries to do analysis on PB's of data (big data). There is definitely need to consider performance when we do any sort of development on hive which ultimately transforming and processing huge amount of data for analytics purpose.

There are several hive optimization techniques which we can used to improve hive performance.

Hive Optimization broadly can be classified in 3 major category.

1)&nbsp;Table Design level (Structure level):&nbsp;

&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Design level optimization is the one which we need to consider while designing and defining structure of tables few very important optimization concept like Partitioning , Bucketing ,Specialized File Format and various Compression&nbsp;technique fall under this category&nbsp;

&nbsp;&nbsp; &nbsp;Partitioning &amp; Bucketing&nbsp;is the way of dividing&nbsp;data into smaller parts so we end up scanning less data. The main idea is how to select column for partitioning and bucketing to make it efficient.

We need to go for partitioning when we have less no of distinct value on column like country or state or city and bucketing can be done when we have more no of distinct value like id column.

We can go with multi-level partitioning and combination of both partitioning and bucketing in single table

&nbsp;&nbsp; &nbsp;File format&nbsp;- We can choose specialized file format to store data in effective way like row based file format&nbsp; -&nbsp;AVRO&nbsp;and column based file format like&nbsp;ORC&nbsp;&amp;&nbsp;Parquet&nbsp;also we can utilize compression&nbsp;technique like&nbsp;SNAPPY,&nbsp;GZIP,&nbsp;BZIP2&nbsp;&amp;&nbsp;LZO.

&nbsp;&nbsp; &nbsp;Query level optimization need to be considered while writing hive query&nbsp; - Join Optimization like&nbsp;Map side&nbsp;join&nbsp;,&nbsp;Bucket map join&nbsp;,&nbsp;Sort merge bucket join&nbsp;can be considered as query or code level optimization.

&nbsp;Also based on use case and to reduce complexity in query we can use window function as and when required&nbsp;

&nbsp;&nbsp; &nbsp;We can consider and tweak some existing property to make our hive job optimize while executing&nbsp; - we can choose execution engine as&nbsp;spark&nbsp;or&nbsp;tez&nbsp;instead of&nbsp;mr&nbsp;which is default execution engine also we can use&nbsp;vectorization&nbsp;where we can process bunch of record at same time as&nbsp; batches of 1024 rows at once instead of single row each time.

Above is the basic idea about hive optimization will talk more about each of the optimization technique&nbsp;in coming&nbsp;post!&nbsp;

Optimization is key in any Big Data workload 

Big Data - Hive Optimization

Associate Consultant - Big Data Engineer

Specialist Data Engineer - Big Data/DWH/ETL/DQ

Sr. Engineer - ETL/DWH-Big Data Hadoop & DQ

Sr. Associate - ETL/DWH

Associate Projects

Associate Projects(Onsite)

Programmer Analyst

Understanding and relating to mentees' experiences, challenges, and emotions. Empathy helps mentors build trust and establish a connection with their mentees, creating a safe space for open dialogue.

Empathy

Guiding mentees on effective time management strategies to balance their personal and professional responsibilities efficiently.

Time Management

Being fully present and attentive when mentees are expressing their thoughts, concerns, or goals. Active listening helps mentors understand mentees' needs better and fosters a supportive environment.

Active Listening

Assisting mentees in setting realistic and achievable goals, and helping them create action plans to accomplish those goals.

Goal Setting

elping mentees develop resilience to overcome setbacks and adapt to changing circumstances in their personal and professional lives.

Resilience and Adaptability

Gourav Nagar

Big Data is an important skill to learn because it enables businesses to collect, organize, and analyze large amounts of data to gain valuable insights from it. This can help them make better decisions, improve operations, and increase efficiency.

Big Data

Spark is a powerful technology for processing large datasets quickly and efficiently. It is becoming increasingly important in data science and big data applications, and is a valuable skill to have in today's data-driven world.

Spark

SQL is a powerful language for retrieving, manipulating, and managing data. It is essential for data analysis, report generation, and decision-making, making it a valuable skill to learn for any data-driven role.

Hadoop is an important skill to learn because it is a powerful platform for data storage and analysis. It enables businesses to process large amounts of data efficiently and cost-effectively, which is essential for data-driven decision making.

Hadoop

ETL is an important skill to learn as it allows data to be moved from one system to another, making it easier to access and analyze. It is a foundational skill for data analysis and data science, and is used across a variety of industries.

EMR is important to learn as it helps increase productivity and accuracy, while reducing time and cost. It also provides a secure and comprehensive way to manage patient records.

Data Warehousing is important to learn as it allows businesses to store and analyse data from multiple sources, giving them a better understanding of their customers, markets and operations - helping them to make informed decisions.

Data Warehousing

AWS is an important skill to learn because it is the leading cloud computing platform that is used by many companies and organizations to store, manage and process data. It provides reliable, cost-effective and scalable solutions for a variety of applications.



Python is a powerful and versatile programming language that can be used to develop solutions for many different problems. It is easy to learn and understand, and has a wide range of libraries and frameworks for advanced development.

Python



Analytics is an important skill to learn as it provides valuable insights into data, helping to make informed decisions and create strategies that are based on evidence. It is a vital skill in today’s data-driven world.

Analytics



OLAP is an important skill to learn because it enables efficient analysis of large datasets, allowing us to gain meaningful insights from data quickly and easily.

OLAP



Glue is an important skill to learn as it helps students to think more critically, be more creative, and develop problem-solving skills. It also encourages collaboration and communication skills, which are essential for future success.

Glue



Spark Streaming is an important skill to learn as it helps to process data in real-time, which can be used to build better applications and drive efficient decision making.

Spark Streaming



Agile is important to learn because it encourages collaboration and transparency, encourages regular feedback, and allows teams to adjust quickly to changing customer needs. It is also critical for successful project delivery.