Big Data - Hive Optimization

Optimization is key in any Big Data workload

Mentor

Blog

Hive run on top of hadoop and widely used in industry to run queries to do analysis on PB's of data (big data). There is definitely need to consider performance when we do any sort of development on hive which ultimately transforming and processing huge amount of data for analytics purpose.

There are several hive optimization techniques which we can used to improve hive performance.

Hive Optimization broadly can be classified in 3 major category.

1) Table Design level (Structure level)

2) Query Level 

3) Execution Level

1) Table Design level (Structure level): 

                    Design level optimization is the one which we need to consider while designing and defining structure of tables few very important optimization concept like Partitioning , Bucketing ,Specialized File Format and various Compression technique fall under this category 

    Partitioning & Bucketing is the way of dividing data into smaller parts so we end up scanning less data. The main idea is how to select column for partitioning and bucketing to make it efficient.

We need to go for partitioning when we have less no of distinct value on column like country or state or city and bucketing can be done when we have more no of distinct value like id column.

We can go with multi-level partitioning and combination of both partitioning and bucketing in single table

    File format - We can choose specialized file format to store data in effective way like row based file format  - AVRO and column based file format like ORC & Parquet also we can utilize compression technique like SNAPPY, GZIP, BZIP2 & LZO.

2) Query level: 

    Query level optimization need to be considered while writing hive query  - Join Optimization like Map side join , Bucket map join , Sort merge bucket join can be considered as query or code level optimization.

 Also based on use case and to reduce complexity in query we can use window function as and when required 

3) Execution level: 

    We can consider and tweak some existing property to make our hive job optimize while executing  - we can choose execution engine as spark or tez instead of mr which is default execution engine also we can use vectorization where we can process bunch of record at same time as  batches of 1024 rows at once instead of single row each time.

Above is the basic idea about hive optimization will talk more about each of the optimization technique in coming post!