Hive design level optimization
Blog
We use hive to query and analyze data set available on HDFS or AWS S3. While reading the specific set of data from TB's or PB's data reading should be efficient and partitioning is the best way to make this read efficient and make significant performance gain on hive.
Partition is the one of the major optimization technique which we need to consider at the time of table design.
Partitioning can be done on columns with low cardinality - it means we can go for partition when we have less no of distinct value in particular column.
Partition works by dividing the data into smaller logical segments. Separate record in manageable part based on column values so we finally scan less data while querying.
Suppose we are creating customer table which is having entry for customers spread across globe we can divide our data based on the country column.
table_name/partition_name/file
customer/country=india/India_data
customer/country=usa/usa_data
customer/country=canada/canada_data
If we want to query data for specific country like India hive will search directly to india partition and end up scanning on one partition out of many.
There is 2 type of partition available in Hive
1) Static Partition
2) Dynamic Partition
1) Static Partition :
Static partition is manual way where we have to mention partition name in which we are loading data . we should be well aware of data in advance. In most of the industry use cases we usually not much aware of data in advance hence static partition is not frequently used however its fast as compared to dynamic way of partition as we are manually loading the data in specified partition.Also static partition is the default partition in hive.
We will discuss one simple use case to understand where we are creating table customer which is having data across the globe and choose country as partition column (we have less no of distinct value in country column and its best fit for partition column for this table)
Step 1: Create partition table
Create table customer (cust_id int, cust_name string,cust_contact long)
partitioned by country string
row format delimited fields terminated by ',';
Step 2: Load data into partition table by load command with partition name specified in the load command
Load data local inpath /local_path/data/file_india into table customer partition (country=india)
Load data local inpath /local_path/data/file_usa into table customer partition (country=usa)
Load data local inpath /local_path/data/file_canada into table customer partition (country=canada)
Validation :
Show partition customer;
country=india
country=usa
country=canada
Partition table folder structure:
user/hive/warehouse/db_name/customer/country=india/file_india
user/hive/warehouse/db_name/customer/country=usa/file_usa
user/hive/warehouse/db_name/customer/country=canada/file_cananda
Query on partition table:
If we query now for data specific to any country india or usa it will scan for only that partition in table.
Select * from customer where country='india'
2) Dynamic partition:
Dynamic partition is the way to partition table at runtime dynamically - without having any manual innervation data automatically loaded to respective partition based on partition column value.
This is most commonly used partitioning technique in hive.
As default partition is static in hive we have to set certain parameter to make dynamic partition enabled in hive.
set.hive.exec.dynamic.partition=true ;
set hive.exec.dynamic.partiion.mode=nonstrict ;
While creating dynamic we can not load directly data in partition table with load commands instead we have to follow certain steps where we need to insert data from one table to another.
=>Create stage table(Non-partition table) and load data into it with load command
=>Create partition table
=>Insert data from Non-partition table to Partition table with Insert into command
We use similar example of customer table here for dynamic partition.
1) Create Non partition table -stage table
Create table customer_stage (cust_id int, cust_name string,cust_contact long,country string)
row format delimited fields terminated by ',' ;
2) Load data to Non Partitioned table with normal load commands
Load data local inpath /local_path/data/file1 into table customer;
Load data local inpath /local_path/data/file2 into table customer ;
Load data local inpath /local_path/data/file3 into table customer ;
3) Create Partition table:
Create table customer (cust_id int, cust_name string,cust_contact long)
partitioned by country string
row format delimited fields terminated by ',';
4) Load data from Non Partition -stage table to partition table:
Insert into customer partition(country) select * from customer_stage ;
Validation :
Show partition customer ;
country=india
country=usa
country=canada
Partition table folder structure: Folder structure will be similar as we have same value in partition
user/hive/warehouse/db_name/customer/country=india/file_india
user/hive/warehouse/db_name/customer/country=usa/file_usa
user/hive/warehouse/db_name/customer/country=canada/file_cananda
Partition table with more than 1 column :
Multi level partitioning can also possible in hive where we can do partition at more than one layer. Here we can achieve more level of optimization as we are dividing data further in more parts
Here we will do multi level partition for same customer table which is having column like country , state, city
1) Create Non partition table -stage table
Create table customer_stage (cust_id int, cust_name string,cust_contact long,country string,state string,city string)
row format delimited fields terminated by ',' ;
2) Load data to Non Partitioned table with normal load commands
Load data local inpath /local_path/data/file1 into table customer;
Load data local inpath /local_path/data/file2 into table customer ;
Load data local inpath /local_path/data/file3 into table customer ;
3) Create Partition table
Create table customer (cust_id int, cust_name string,cust_contact long)
partitioned by (country string,state string, city string)
row format delimited fields terminated by ',';
4) Load data from Non Partition -stage table to partition table
Insert into customer partition(country,state.city) select * from customer_stage ;
Validation :
Show partition customer ;
country=india/state=maharashtra/city=pune
country=india/state=maharashtra/city=mumbai
country=india/state=mp/city=indore
country=india/state=mp/city=bhopal
country=usa/state=houston/city=texas
country=usa/state=chicago/city=illinois
Partition table folder structure: Folder structure will be similar as we have same value in partition
user/hive/warehouse/db_name/customer/country=india/state=maharashtra/city=pune/file1
user/hive/warehouse/db_name/customer/country=india/state=maharashtra/city=mumbai/file1
Query on partition table:
If we query now for data specific to any country india or usa. It will scan for only that partition in table.
Select * from customer where country='india' and city='pune'
Now if we run query to search data for pune or perform any query or analysis on pune data it will scan directly on pune folder inside india partition which again give us more level of optimization and faster reads as compare to search all country data.
Apart from partitioning we have few more design level optimization technique like Bucketing , File format and compression techniques which we will discuss in more detail on coming post.
Follow my blog for more Big data and data engineering post.
Blog Name: bigdatabygourav.blogspot.com
Copyright Β©2024 Preplaced.in
Preplaced Education Private Limited
Ibblur Village, Bangalore - 560103
GSTIN- 29AAKCP9555E1ZV