Big Data - Hive design level optimization - Partitioning (Static & Dynamic )

Hive design level optimization

Blog

We use hive to query and analyze data set available on HDFS or AWS S3. While reading the specific set of data from TB's or PB's data reading should be efficient and partitioning is the best way to make this read efficient and make significant performance gain on hive.

Partition is the one of the major optimization technique which we need to consider at the time of table design.

Partitioning can be done on columns with low cardinality - it means we can go for partition when we have less no of distinct value in particular column.

Partition works by dividing the data into smaller logical segments. Separate record in manageable part based on column values so we finally scan less data while querying.

Suppose we are creating customer table which is having entry for customers spread across globe we can divide our data based on the country column.

table_name/partition_name/file

customer/country=india/India_data

customer/country=usa/usa_data

customer/country=canada/canada_data

If we want to query data for specific country like India hive will search directly to india partition and end up scanning on one partition out of many.

There is 2 type of partition available in Hive

1) Static Partition

2) Dynamic Partition

1) Static Partition :

Static partition is manual way where we have to mention partition name in which we are loading data . we should be well aware of data in advance. In most of the industry use cases we usually not much aware of data in advance hence static partition is not frequently used however its fast as compared to dynamic way of partition as we are manually loading the data in specified partition.Also static partition is the default partition in hive.

We will discuss one simple use case to understand where we are creating table customer which is having data across the globe and choose country as partition column (we have less no of distinct value in country column and its best fit for partition column for this table)

Step 1: Create partition table

Create table customer (cust_id int, cust_name string,cust_contact long)

partitioned by country string

row format delimited fields terminated by ',';

Step 2: Load data into partition table by load command with partition name specified in the load command

Load data local inpath /local_path/data/file_india into table customer partition (country=india)

Load data local inpath /local_path/data/file_usa into table customer partition (country=usa)

Load data local inpath /local_path/data/file_canada into table customer partition (country=canada)

Validation :

Show partition customer;

country=india

country=usa

country=canada

Partition table folder structure:

user/hive/warehouse/db_name/customer/country=india/file_india

user/hive/warehouse/db_name/customer/country=usa/file_usa

user/hive/warehouse/db_name/customer/country=canada/file_cananda

Query on partition table:

If we query now for data specific to any country india or usa it will scan for only that partition in table.

Select * from customer where country='india'

2) Dynamic partition:

Dynamic partition is the way to partition table at runtime dynamically - without having any manual innervation data automatically loaded to respective partition based on partition column value.

This is most commonly used partitioning technique in hive.

As default partition is static in hive we have to set certain parameter to make dynamic partition enabled in hive.

set.hive.exec.dynamic.partition=true ;

set hive.exec.dynamic.partiion.mode=nonstrict ;

While creating dynamic we can not load directly data in partition table with load commands instead we have to follow certain steps where we need to insert data from one table to another.

=>Create stage table(Non-partition table) and load data into it with load command

=>Create partition table

=>Insert data from Non-partition table to Partition table with Insert into command

We use similar example of customer table here for dynamic partition.

1) Create Non partition table -stage table

Create table customer_stage (cust_id int, cust_name string,cust_contact long,country string)

row format delimited fields terminated by ',' ;

2) Load data to Non Partitioned table with normal load commands

Load data local inpath /local_path/data/file1 into table customer;

Load data local inpath /local_path/data/file2 into table customer ;

Load data local inpath /local_path/data/file3 into table customer ;

3) Create Partition table:

Create table customer (cust_id int, cust_name string,cust_contact long)

partitioned by country string

row format delimited fields terminated by ',';

4) Load data from Non Partition -stage table to partition table:

Insert into customer partition(country) select * from customer_stage ;

Validation :

Show partition customer ;

country=india

country=usa

country=canada

Partition table folder structure: Folder structure will be similar as we have same value in partition

user/hive/warehouse/db_name/customer/country=india/file_india

user/hive/warehouse/db_name/customer/country=usa/file_usa

user/hive/warehouse/db_name/customer/country=canada/file_cananda

Partition table with more than 1 column :

Multi level partitioning can also possible in hive where we can do partition at more than one layer. Here we can achieve more level of optimization as we are dividing data further in more parts

Here we will do multi level partition for same customer table which is having column like country , state, city

1) Create Non partition table -stage table

Create table customer_stage (cust_id int, cust_name string,cust_contact long,country string,state string,city string)

row format delimited fields terminated by ',' ;

2) Load data to Non Partitioned table with normal load commands

Load data local inpath /local_path/data/file1 into table customer;

Load data local inpath /local_path/data/file2 into table customer ;

Load data local inpath /local_path/data/file3 into table customer ;

3) Create Partition table

Create table customer (cust_id int, cust_name string,cust_contact long)

partitioned by (country string,state string, city string)

row format delimited fields terminated by ',';

4) Load data from Non Partition -stage table to partition table

Insert into customer partition(country,state.city) select * from customer_stage ;

Validation :

Show partition customer ;

country=india/state=maharashtra/city=pune

country=india/state=maharashtra/city=mumbai

country=india/state=mp/city=indore

country=india/state=mp/city=bhopal

country=usa/state=houston/city=texas

country=usa/state=chicago/city=illinois

Partition table folder structure: Folder structure will be similar as we have same value in partition

user/hive/warehouse/db_name/customer/country=india/state=maharashtra/city=pune/file1

user/hive/warehouse/db_name/customer/country=india/state=maharashtra/city=mumbai/file1

Query on partition table:

If we query now for data specific to any country india or usa. It will scan for only that partition in table.

Select * from customer where country='india' and city='pune'

Now if we run query to search data for pune or perform any query or analysis on pune data it will scan directly on pune folder inside india partition which again give us more level of optimization and faster reads as compare to search all country data.

Apart from partitioning we have few more design level optimization technique like Bucketing , File format and compression techniques which we will discuss in more detail on coming post.

Follow my blog for more Big data and data engineering post.

Blog Name: bigdatabygourav.blogspot.com

Table of Contents

Gourav Nagar

Associate Consultant - Big Data Engineer

Tata Consultancy Services

14 yrs of Exp. at

Tata Consultancy Services|LTI - Larsen & Toubro Infotech

" 🚀 Unlock Your Big Data Potential: Join My Mentorship Journey! 🚀 Are you eager to conquer Big Data Engineering and harness the power of data for innovation and s...read more

2x Sessions

Unlimited Chat

Job Referrals

Big Data

Spark

SQL

Hadoop

ETL

EMR

Data Warehousing

AWS

₹10,000

Per Month

5.0(20+)

View Profile

For:

Experienced Professional

Targeting Domains:

Data Engineer / Big ..

Get started by booking a free trial session with the mentor of your choice.

[email protected]

Preplaced Education Private Limited

Ibblur Village, Bangalore - 560103

GSTIN- 29AAKCP9555E1ZV