Lab: hive
Goals
- Create an external table on top of data stored as CSV
- Create a managed ORC table
- Load data from the CSV table to the ORC table (with some transformations)
Create an external table
For this lab we will be using a very small dataset of NYC taxi drivers.
Using the official Hive Data Definition Langage:
- Using the HDFS CLI, take a look at the data used for this lab at
/education/ece/big-data/2020/fall/bda/resources/lab4/nyc_drivers/drivers.csv
- Copy the
nyc_drivers
folder to your user directory in HDFS:#export ece_group=gr1 hdfs dfs -mkdir -p "/user/$USER/labs" hdfs dfs -cp /education/ece/big-data/2020/fall/bda/resources/lab4/nyc_drivers "/user/$USER/labs/"
- Open a Beeline session by typing
beeline
- Create an external table targeting our data with this statement (to be completed, replace
YOUR_USERNAME
):SET hivevar:username=YOUR_USERNAME; CREATE EXTERNAL TABLE hive_labs.${username}_nyc_drivers_ext ( driver_id INT, -- COMPLETE HERE ) ROW FORMAT SERDE -- COMPLETE HERE STORED AS TEXTFILE LOCATION -- COMPLETE HERE TBLPROPERTIES ('skip.header.line.count'='1');
- Check that the table is correctly created by selecting all the data in it. If you see only
NULL
values, your schema is not correct.
Create a managed ORC table
Tip: to create a managed ORC table, you don’t have to specify a LOCATION
nor a SERDE
:
CREATE ...
STORED AS ORC;
- Create a managed ORC table (not external) that must have the same schema as the external table created above (
${username}_nyc_drivers_ext
) but with:- The
_ext
prefix removed from the name:${username}_nyc_drivers
- The column
name
devided intofirst_name
andlast_name
- The columne
location
renamed asaddress
(becauseLOCATION
is a Hive keyword) - The column
certified
as aBOOLEAN
- The
- Check that your table was created using the HDFS CLI at
/warehouse/tablespace/managed/hive/hive_labs.db/$USER_nyc_drivers
(should be empty)
Load data from the CSV table to the ORC table
Now we want to populate our ORC table from our CSV table. Using the Hive Data Manipulation Language:
- Write a statement to insert data to the ORC table by applying 2 transformations (check the available HiveQL string functions):
- Split
name
intofirst_name
andlast_name
- Transform
certified
fromSTRING
toBOOLEAN
- Rename
location
toaddress
- Split
- Execute your query
- Check what the data looks like in the managed table using the HDFS CLI at
/warehouse/tablespace/managed/hive/hive_labs.db/$USER_nyc_drivers