Hive Introduction

Hive is an Apache project used for warehousing infrastructure. It provides fault tolerance and scalabity for data storage. Hive design allow the usage of query language to access distributed data stored as files. Hive is not designed for online transaction processing (OLTP). It is best used for traditional data warehousing tasks (OLAP).

Hive on the cluster

Hive clients: jdbc/odbc drivers
Hive Services: CLI, Hive server, Metastore, File System, Execution Engine
Hive storage and compute: metastore db, Hadoop cluster

Hive Services

Beeline

beeline is a command-line interface to Hive that works in embedded mode (like the regular CLI), or by connecting to a HiveServer 2 process using JDBC. Hive can be access via command line using beeline from a SSH session.

beeline -u "jdbc:hive2://zoo-1.au.adaltas.cloud:2181,zoo-2.au.adaltas.cloud:2181,zoo-3.au.adaltas.cloud:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;" --showDbInPrompt=true

It will prompt you with the following:

Beeline version 3.1.0.3.1.0.0-78 by Apache Hive
0: jdbc:hive2://zoo-1.au.adaltas.cloud:2181,z>

You can see the Hive version (3.1.0) and the JDBC address used to connect to hive via Zookeeper

There are 3 zookeeper servers on the cluster:

zoo-1.au.adaltas.cloud:2181
zoo-2.au.adaltas.cloud:2181
zoo-3.au.adaltas.cloud:2181

Hive Metastore

Metastore is the central repository of Apache Hive metadata. It stores metadata for Hive tables and partitions in a relational database.

Create a database

The syntaxe is the following CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>

create database  if not exists moviedb;

show the list of databases;

show databases;

Managing Tables

A Hive table is logically made up of the data and its associated metadata. The data resides in HDFS or in any Hadoop filesystem, including the local filesystem or S3. The metadata is tored in a relational database.

select database

Except for string comparison hive is generrally case insensitive

use moviedb;

Create a table

When you create a table in Hive, by default Hive moves the data into its warehouse directory. The created table is called a managed table. A managed table is stored under the hive.metastore.warehouse.dir path property, by default in a folder similar to /user/hive/warehouse/databasename.db/tablename/. The default location can be overridden by the location property during table creation. If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted.

You may create an external table, thus Hive refers to the data that is at an existing location outside the warehouse directory.

An external table describes a schema but does manage the data. External table files can be accessed and managed by processes outside of Hive. External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE "table_name" statement can be used to refresh the metadata information.

Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.

external table

We have the following txt file ratings.txt that contains userid, movieid, ratings date and country.

cat > ratings.txt <<EOL
1,44,5,2020-10-01,FR
2,21,1,2020-03-01,FR
3,35,4,2020-03-01,US
4,44,4,2020-10-01,US
1,21,2,2020-10-01,FR
EOL

Let’s put the data on hdfs

hdfs dfs -mkdir -p '/user/$USER/data'
hdfs dfs -put ratings.txt /user/$USER/data/ratings.txt

Create an external table

SET hivevar:username='YOUR USERNAME';
CREATE EXTERNAL TABLE moviedb.${username}_ratings_ext ( userid INT, movieid INT,
ratings INT, dt STRING, country STRING)
   COMMENT 'Movie ratings'
   ROW FORMAT DELIMITED 
   FIELDS TERMINATED BY ','
   STORED AS TEXTFILE
   LOCATION '/user/${username}/data';

If the file has a header use the option TBLPROPERTIES ('skip.header.line.count'='1');

List tables

show tables;

Check the structure of a table

describe moviedb.${username}_ratings_ext;

Select data

select * from moviedb.${username}_ratings_ext;

Managed table

Now, let’s create a table that is managed by Hive with the following command:

CREATE TABLE IF NOT EXISTS moviedb.${username}_ratings ( userid INT, movieid INT,
ratings INT, dt STRING, country STRING)
COMMENT 'data about movie ratings'
STORED AS ORC;

Insert data in managed table

Fill managed table with external table’s data

INSERT OVERWRITE TABLE moviedb.${username}_ratings SELECT * FROM moviedb.${username}_ratings_ext;

Check the results

SELECT * FROM moviedb.${username}_ratings;

Partitions and Buckets

Partitions

Partition is a way of dividing a table into related parts based on the values of particulars columns to permit more efficient queries. A table may be partitioned in multiple dimensions.

CREATE TABLE moviedb.${username}_ratings_part( userid INT, movieid INT, ratings INT, dt STRING)
PARTITIONED BY (country STRING);

For partitionning we have to set this property

set hive.exec.dynamic.partition.mode=nonstrict

INSERT OVERWRITE TABLE moviedb.${username}_ratings_part PARTITION(country) 
SELECT * FROM moviedb.${username}_ratings;

SHOW PARTITIONS moviedb.${username}_ratings_part

At file system level you can check

hdfs dfs -ls /warehouse/tablespace/managed/hive/moviedb.db/${USER}_ratings_part

Buckets

You can organize your tables or partitions into buckets for more efficient queries or sampling purpose.

CREATE TABLE moviedb.${username}_ratings_bucket(userid INT, movieid INT,
ratings INT, dt STRING, country STRING)
CLUSTERED BY (country) INTO 4 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

INSERT OVERWRITE TABLE moviedb.${username}_ratings_bucket
SELECT * FROM moviedb.${username}_ratings;

Adaltas Cloud Academy