HBase introduction and shell usage
This page provides an introduction to Apache HBase, how to interact with HBase shell, and presents a use case of a file import from HDFS into HBase.
HBase definition
Apache HBase is a column-oriented non-relational database (NoSQL) that runs on Hadoop Distributed File System (HDFS). It provides real-time, random and consistent write and read access to tables containing billions of rows and millions of columns. HBase applications are written in Java. This database has a high fault tolerance, the data is replicated on the different servers of the Data Center, and an automatic failover guarantees high availability.
Basic HBase commands
To run a secure HBase, the edge node of the Adaltas Cloud cluster is configured to use Kerberos authentication. Being there via SSH, you can access to HBase shell by typing:
hbase shell
You should see:
hbase(main):001:0>
The shell providing by HBase allows you to communicate with HBase. Given below are some of the general commands supported by HBase shell.
exit
allows you to log out and exit this shell.help
gives you help about the main commands of the shell.status
lists the number of HBase servers used.version
gives the HBase version used.table_help
provides help for table-reference commands.whoami
Provides information about the user.
Table management
data storage mechanism in HBase
HBase is a column-oriented database, where data is stored in tables sorted by row IDs. A table has one or more column families and each column family can have multiple numbers of columns (aka column qualifier). Each cell of the table has a timestamp. To resume, the storage mechanism in HBase is composed of:
- Table: collection of rows.
- Row: collection of column families.
- Column family: collection of columns.
- Column: collection of key value pairs.
- Cell: A tuple composed of a row, a column and a version which exactly specifies a cell definition in HBase.
- Timestamp: a combination of date and time.
Now that we have an overview of the table schema in HBase, let’s discover how to manipulate tables and data from the interactive shell.
Create a table
You can create a table with the create
command and passing the table name and the column family:
hbase(main):001:0> create 'table_name' 'column_family'
Display HBase tables
The list
command shows all the tables created in HBase. So, you can check if your table has been created by typing:
hbase(main):001:0> list
You can have the description of your table using the describe
command:
hbase(main):001:0> describe 'table_name'
Insert data in a table
To add data in a table the put
command allows to insert rows into a table:
hbase(main):001:0> put 'table_name', 'rowID', 'column_family:column', 'value'
This command create the cell corresponding to the specified rowID and column. But if the cell is already existing, the corresponding value will be overwrite and updated by the given value.
Delete data in a table
You can use the delete
command to delete a specific cell in a table:
hbase(main):001:0> delete 'table_name', 'rowID', 'column_family:column', 'timestamp'
To delete all the cells in a row you can use the deleteall
command:
hbase(main):001:0> deleteall 'table_name', 'rowID'
Read data
The get
command is used to get a single row of a table:
hbase(main):001:0> get 'table_name' 'rowID'
NB: You can use it to check if the put
command worked properly.
You can also use the get
command to read a specific column:
hbase(main):001:0> get 'table_name', 'rowID', {COLUMN ⇒ 'column_family:column'}
The scan
command can be used to get all the data of a table:
hbase(main):001:0> scan 'table_name'
Delete a table
To delete a table you need to disable it first:
hbase(main):001:0> disable 'table_name'
You can check if a table is enabled or not with the following command:
hbase(main):001:0> is_enabled 'table_name'
NB: If you try to delete a enabled table an error will be displayed ERROR: Table table_name is enabled. Disable it first.
Next, you can delete it with the drop
command:
hbase(main):001:0> drop 'table_name'
You can check if the table is deleting using the exists
command:
hbase(main):001:0> exists 'table_name'
Namespace management
Namespace is used for logical table grouping into the HBase system. It used to resource management, security, isolation. For example, a namespace can be created to group tables and to hand out specific permissions (i.e allow a user to only read the data inside the table) to the users.
NB: All users will be able to see namespaces and tables within namespaces, but not the data.
Create a namespace
Using the create_namespace
command you can create a new namespace:
create_namespace 'namespace_name'
Then, if you want to create an HBase table in that namespace:
hbase(main):001:0> create 'namespace_name:table_name','column_family'
If you are not specifying any namespace in HBase, table will get created in the default namespace.
List tables created under a namespace
You can display all available namespaces with the list_namespace
command:
hbase(main):001:0> list_namespace
It’s possible to display all the tables created in a given namespace using the list_namespace_tables
command:
hbase(main):001:0> list_namespace_tables 'namespace_name'
Delete namespace
To delete a namespace, use the drop_namespace
namespace
hbase(main):001:0> drop_namespace 'namespace_name'
NB: It is only possible to drop empty namespace. So, if you are looking to drop any namespace, you first need to drop all the tables created in that namespace.
Use case of a CSV file import and filter applications
The goal of this section is to import the Taxi Ride dataset in CSV format from HDFS into HBase and then apply filters on the HBase table.
Import CSV file from HDFS into HBase
Step 1: Data preparation
If our dataset is stored localy we need to put it into HDFS. With the command below I copy file from local root/data
to HDFS root/data
path:
hadoop fs -copyFromLocal root/data/nycTaxiRides.csv /data/nycTaxiRides
Step 2: HBase table creation
First, we need to go on the HBase shell by typing:
hbase shell
Using the HBase shell, create a new HBase table correponding to our dataset:
hbase(main):001:0> create 'nyc-taxi', 'cf'
Check if the table has been created by typing:
hbase(main):001:0> list
Step 3: Import data into HBase
Leave HBase shell by typing exit
and run ImportTsv
command which allows to load data into HBase. Once we submit the job, a MapReduce job will get started:
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY, cf:isStart, cf:endTime, cf:startTime, cf:startLon, cf:startLat, cf:endLon, cf:endLat, cf:passengerCnt, cf:taxiId, cf:driverId" nyc-taxi /data/nycTaxiRides/nycTaxiRides.csv
-Dimporttsv.columns
indicates columns name corresponding to our dataset.-Dimporttsv.separator
specifies the delimiter of the CSV file.
Check the data in the HBase table using scan 'nyc-taxi'
.
Apply filters
By default, a Scan
reads the entire table from start to end. It’s possible to limit the Scan
result by using a Filter
. HBase includes several filter types, and allows to create your own custom filters. It is necessary to import some classes into HBase shell in order to use a filter. Given below are some of the filters:
PrefixFilter
is an example of a filter that applies filter to the row key.TimeStampsFilter
allows to keep specified timestamps.FamilyFilter
orQualifierFilter
compares each column family or column qualifier with a specified comparator using a chosen compare operator.
Let’s choose a Filter
that allows to find all the drivers whose start a ride: SingleColumnValueFilter
.
Using the HBase shell scan the table and keep only the driverId column:
scan 'nyc-taxi', {COLUMNS=>'cf:driverId'}
Then, keep only the rows where the column isStart
is equal to “START”:
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.BinaryComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'nyc-taxi', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('cf'),
Bytes.toBytes('isStart'), CompareFilter::CompareOp.valueOf('EQUAL'),
BinaryComparator.new(Bytes.toBytes('START'))), COLUMNS=>'cf:driverId'}
SingleColumnValueFilter
needs the column family and the qualifier wanted. Then,CompareFilter::CompareOp.valueOf
requires the compare operator, it can be for example EQUAL, GREATER_OR_EQUAL or LESS. Finally, BinaryComparator
takes as parameter the comparator to which the result should match.