HBase introduction and shell usage

This page provides an introduction to Apache HBase, how to interact with HBase shell, and presents a use case of a file import from HDFS into HBase.

HBase definition

Apache HBase is a column-oriented non-relational database (NoSQL) that runs on Hadoop Distributed File System (HDFS). It provides real-time, random and consistent write and read access to tables containing billions of rows and millions of columns. HBase applications are written in Java. This database has a high fault tolerance, the data is replicated on the different servers of the Data Center, and an automatic failover guarantees high availability.

Basic HBase commands

To run a secure HBase, the edge node of the Adaltas Cloud cluster is configured to use Kerberos authentication. Being there via SSH, you can access to HBase shell by typing:

hbase shell

You should see:

hbase(main):001:0>

The shell providing by HBase allows you to communicate with HBase. Given below are some of the general commands supported by HBase shell.

exit allows you to log out and exit this shell.
help gives you help about the main commands of the shell.
status lists the number of HBase servers used.
version gives the HBase version used.
table_help provides help for table-reference commands.
whoami Provides information about the user.

Table management

data storage mechanism in HBase

HBase is a column-oriented database, where data is stored in tables sorted by row IDs. A table has one or more column families and each column family can have multiple numbers of columns (aka column qualifier). Each cell of the table has a timestamp. To resume, the storage mechanism in HBase is composed of:

Table: collection of rows.
Row: collection of column families.
Column family: collection of columns.
Column: collection of key value pairs.
Cell: A tuple composed of a row, a column and a version which exactly specifies a cell definition in HBase.
Timestamp: a combination of date and time.

Now that we have an overview of the table schema in HBase, let’s discover how to manipulate tables and data from the interactive shell.

Create a table

You can create a table with the create command and passing the table name and the column family:

hbase(main):001:0> create 'table_name' 'column_family'

Display HBase tables

The list command shows all the tables created in HBase. So, you can check if your table has been created by typing:

hbase(main):001:0> list

You can have the description of your table using the describe command:

hbase(main):001:0> describe 'table_name'

Insert data in a table

To add data in a table the put command allows to insert rows into a table:

hbase(main):001:0> put 'table_name', 'rowID', 'column_family:column', 'value'

This command create the cell corresponding to the specified rowID and column. But if the cell is already existing, the corresponding value will be overwrite and updated by the given value.

Delete data in a table

You can use the delete command to delete a specific cell in a table:

hbase(main):001:0> delete 'table_name', 'rowID', 'column_family:column', 'timestamp'

To delete all the cells in a row you can use the deleteall command:

hbase(main):001:0> deleteall 'table_name', 'rowID'

Read data

The get command is used to get a single row of a table:

hbase(main):001:0> get 'table_name' 'rowID'

NB: You can use it to check if the put command worked properly.

You can also use the get command to read a specific column:

hbase(main):001:0> get 'table_name', 'rowID', {COLUMN ⇒ 'column_family:column'}

The scan command can be used to get all the data of a table:

hbase(main):001:0> scan 'table_name'

Delete a table

To delete a table you need to disable it first:

hbase(main):001:0> disable 'table_name'

You can check if a table is enabled or not with the following command:

hbase(main):001:0> is_enabled 'table_name'

NB: If you try to delete a enabled table an error will be displayed ERROR: Table table_name is enabled. Disable it first.

Next, you can delete it with the drop command:

hbase(main):001:0> drop 'table_name'

You can check if the table is deleting using the exists command:

hbase(main):001:0> exists 'table_name'

Namespace management

Namespace is used for logical table grouping into the HBase system. It used to resource management, security, isolation. For example, a namespace can be created to group tables and to hand out specific permissions (i.e allow a user to only read the data inside the table) to the users.

NB: All users will be able to see namespaces and tables within namespaces, but not the data.

Create a namespace

Using the create_namespace command you can create a new namespace:

create_namespace 'namespace_name'

Then, if you want to create an HBase table in that namespace:

hbase(main):001:0> create 'namespace_name:table_name','column_family'

If you are not specifying any namespace in HBase, table will get created in the default namespace.

List tables created under a namespace

You can display all available namespaces with the list_namespace command:

hbase(main):001:0> list_namespace

It’s possible to display all the tables created in a given namespace using the list_namespace_tables command:

hbase(main):001:0> list_namespace_tables 'namespace_name'

Delete namespace

To delete a namespace, use the drop_namespace namespace

hbase(main):001:0> drop_namespace 'namespace_name'

NB: It is only possible to drop empty namespace. So, if you are looking to drop any namespace, you first need to drop all the tables created in that namespace.

Use case of a CSV file import and filter applications

The goal of this section is to import the Taxi Ride dataset in CSV format from HDFS into HBase and then apply filters on the HBase table.

Import CSV file from HDFS into HBase

Step 1: Data preparation

If our dataset is stored localy we need to put it into HDFS. With the command below I copy file from local root/data to HDFS root/data path:

hadoop fs -copyFromLocal root/data/nycTaxiRides.csv /data/nycTaxiRides

Step 2: HBase table creation

First, we need to go on the HBase shell by typing:

hbase shell

Using the HBase shell, create a new HBase table correponding to our dataset:

hbase(main):001:0> create 'nyc-taxi', 'cf'

Check if the table has been created by typing:

hbase(main):001:0> list

Step 3: Import data into HBase

Leave HBase shell by typing exit and run ImportTsv command which allows to load data into HBase. Once we submit the job, a MapReduce job will get started:

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=,  -Dimporttsv.columns="HBASE_ROW_KEY, cf:isStart, cf:endTime, cf:startTime, cf:startLon, cf:startLat, cf:endLon, cf:endLat, cf:passengerCnt, cf:taxiId, cf:driverId" nyc-taxi /data/nycTaxiRides/nycTaxiRides.csv

-Dimporttsv.columns indicates columns name corresponding to our dataset.
-Dimporttsv.separator specifies the delimiter of the CSV file.

Check the data in the HBase table using scan 'nyc-taxi'.

Apply filters

By default, a Scan reads the entire table from start to end. It’s possible to limit the Scan result by using a Filter. HBase includes several filter types, and allows to create your own custom filters. It is necessary to import some classes into HBase shell in order to use a filter. Given below are some of the filters:

PrefixFilter is an example of a filter that applies filter to the row key.
TimeStampsFilter allows to keep specified timestamps.
FamilyFilter or QualifierFilter compares each column family or column qualifier with a specified comparator using a chosen compare operator.

Let’s choose a Filter that allows to find all the drivers whose start a ride: SingleColumnValueFilter.

Using the HBase shell scan the table and keep only the driverId column:

scan 'nyc-taxi', {COLUMNS=>'cf:driverId'}

Then, keep only the rows where the column isStart is equal to “START”:

import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.BinaryComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'nyc-taxi', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('cf'),
          Bytes.toBytes('isStart'), CompareFilter::CompareOp.valueOf('EQUAL'),
          BinaryComparator.new(Bytes.toBytes('START'))), COLUMNS=>'cf:driverId'}

SingleColumnValueFilter needs the column family and the qualifier wanted. Then,CompareFilter::CompareOp.valueOf requires the compare operator, it can be for example EQUAL, GREATER_OR_EQUAL or LESS. Finally, BinaryComparator takes as parameter the comparator to which the result should match.