Hive Introduction
Hive is an Apache project used for warehousing infrastructure. It provides fault tolerance and scalabity for data storage. Hive design allow the usage of query language to access distributed data stored as files. Hive is not designed for online transaction processing (OLTP). It is best used for traditional data warehousing tasks (OLAP).
Hive on the cluster
- Hive clients: jdbc/odbc drivers
- Hive Services: CLI, Hive server, Metastore, File System, Execution Engine
- Hive storage and compute: metastore db, Hadoop cluster
Hive Services
Beeline
beeline
is a command-line interface to Hive that works in embedded mode (like the regular CLI), or by connecting to a HiveServer 2 process using JDBC. Hive can be access via command line using beeline from a SSH session.
beeline -u "jdbc:hive2://zoo-1.au.adaltas.cloud:2181,zoo-2.au.adaltas.cloud:2181,zoo-3.au.adaltas.cloud:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;" --showDbInPrompt=true
It will prompt you with the following:
Beeline version 3.1.0.3.1.0.0-78 by Apache Hive
0: jdbc:hive2://zoo-1.au.adaltas.cloud:2181,z>
You can see the Hive version (3.1.0) and the JDBC address used to connect to hive via Zookeeper
There are 3 zookeeper servers on the cluster:
- zoo-1.au.adaltas.cloud:2181
- zoo-2.au.adaltas.cloud:2181
- zoo-3.au.adaltas.cloud:2181
Hive Metastore
Metastore is the central repository of Apache Hive metadata. It stores metadata for Hive tables and partitions in a relational database.
Create a database
The syntaxe is the following CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
create database if not exists moviedb;
show the list of databases;
show databases;
Managing Tables
A Hive table is logically made up of the data and its associated metadata. The data resides in HDFS or in any Hadoop filesystem, including the local filesystem or S3. The metadata is tored in a relational database.
select database
Except for string comparison hive is generrally case insensitive
use moviedb;
Create a table
When you create a table in Hive, by default Hive moves the data into its warehouse directory. The created table is called a managed table. A managed table is stored under the hive.metastore.warehouse.dir
path property, by default in a folder similar to /user/hive/warehouse/databasename.db/tablename/
. The default location can be overridden by the location
property during table creation. If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted.
You may create an external table, thus Hive refers to the data that is at an existing location outside the warehouse directory.
An external table describes a schema but does manage the data. External table files can be accessed and managed by processes outside of Hive. External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE "table_name"
statement can be used to refresh the metadata information.
Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.
external table
We have the following txt file ratings.txt
that contains userid, movieid, ratings date and country.
cat > ratings.txt <<EOL
1,44,5,2020-10-01,FR
2,21,1,2020-03-01,FR
3,35,4,2020-03-01,US
4,44,4,2020-10-01,US
1,21,2,2020-10-01,FR
EOL
Let’s put the data on hdfs
hdfs dfs -mkdir -p '/user/$USER/data'
hdfs dfs -put ratings.txt /user/$USER/data/ratings.txt
Create an external table
SET hivevar:username='YOUR USERNAME';
CREATE EXTERNAL TABLE moviedb.${username}_ratings_ext ( userid INT, movieid INT,
ratings INT, dt STRING, country STRING)
COMMENT 'Movie ratings'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/${username}/data';
If the file has a header use the option
TBLPROPERTIES ('skip.header.line.count'='1');
List tables
show tables;
Check the structure of a table
describe moviedb.${username}_ratings_ext;
Select data
select * from moviedb.${username}_ratings_ext;
Managed table
Now, let’s create a table that is managed by Hive with the following command:
CREATE TABLE IF NOT EXISTS moviedb.${username}_ratings ( userid INT, movieid INT,
ratings INT, dt STRING, country STRING)
COMMENT 'data about movie ratings'
STORED AS ORC;
Insert data in managed table
Fill managed table with external table’s data
INSERT OVERWRITE TABLE moviedb.${username}_ratings SELECT * FROM moviedb.${username}_ratings_ext;
Check the results
SELECT * FROM moviedb.${username}_ratings;
Partitions and Buckets
- Partitions
Partition
is a way of dividing a table into related parts based on the values of particulars columns to permit more efficient queries. A table may be partitioned in multiple dimensions.
CREATE TABLE moviedb.${username}_ratings_part( userid INT, movieid INT, ratings INT, dt STRING)
PARTITIONED BY (country STRING);
For partitionning we have to set this property
set hive.exec.dynamic.partition.mode=nonstrict
INSERT OVERWRITE TABLE moviedb.${username}_ratings_part PARTITION(country)
SELECT * FROM moviedb.${username}_ratings;
SHOW PARTITIONS moviedb.${username}_ratings_part
At file system level you can check
hdfs dfs -ls /warehouse/tablespace/managed/hive/moviedb.db/${USER}_ratings_part
- Buckets
You can organize your tables or partitions into buckets for more efficient queries or sampling purpose.
CREATE TABLE moviedb.${username}_ratings_bucket(userid INT, movieid INT,
ratings INT, dt STRING, country STRING)
CLUSTERED BY (country) INTO 4 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
INSERT OVERWRITE TABLE moviedb.${username}_ratings_bucket
SELECT * FROM moviedb.${username}_ratings;