Hadoop’s HBase with Java API – |

As soon as we come across the term ‘Big Data’, the next word that comes to our mind is ‘Hadoop’. Lets begin with its brief description answering what is it.
Apache Hadoop software library is  an open source software framework that allows distributed storage & distributed processing of large data sets across clusters of computers using simple programming models. It is designed with an intent to scale up from single servers to thousands of machines, each machine offering local processing & storage. It promises to detect & handle failures at the application layer which guarantees  high-available service on the top of clusters of computers.
The Base Apache Hadoop comprises of following modules:-
Hadoop Common – contains libraries & utilities needed by other Hadoop modules.
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters & using them for scheduling of users’ applications.
Hadoop MapReduce – a programming model for large scale data processing.
MapReduce is the heart of Hadoop which allows massive scalability across hundreds or thousands of servers in clusters.
Map/Reduce: The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input & combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.
Hadoop Distributed File System splits files into large blocks(with default size of 64 mb or 128 mb) & distributes the blocks among the nodes in the cluster & for data processing , hadoop’s Map/Reduce (See MapReduce above) delivers code in the form of jar files to the nodes that have the required data, and the nodes then process the data in parallel.
High Level Architectural diagram

Layman workflow diagram within the architecture

                                               Image Courtesy : Google Images

Now you must be thinking being blessed with other such cloud platforms why we need hadoop then. Lets drill down further answering this query:-
1) Lowering production cost
2) Procuring large scale resources quickly.
3) Handling Batch Workloads Efficiently
4) Handling Variable Resource Requirements
5) Simplifying operations.
ZooKeeper : Name sounds bit weird but it does the same work what its name suggests. ZooKeeper provides infrastructure for nodes-synchronization. An Application can use this to ensure that tasks across the cluster are synchronized or serialized. It does by maintaining status type information in the memory of zookeeper servers. The ZooKeeper server keeps copy of the state of the entire system persisting this information in local log files. A very large Hadoop cluster can be supported by multiple ZooKeeper servers (in this case, a master server synchronizes the top-level servers). Each client machine communicates with one of the ZooKeeper servers to retrieve and update its synchronization information.

With this much general information we are all set to gear up towards our main stream i.e HBase.
What is HBase ????
HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases. Unlike relational database systems, HBase does not support a structured query language like SQL; in fact, HBase isn’t a relational data store at all. HBase applications are written in Java much like a typical MapReduce application.
For the ones who are naive for columnar database, below example could help you in grasping it little faster.
A database might have this table:

Row oriented serialize storage solution :-

001:10,Smith,Joe,40000;002:12,Jones,Mary,50000;003:11,Johnson,Cathy,44000;004:22,Jones,Bob,55000;

Column oriented solution:-
A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on. For our example table, the data would be stored in this fashion;

10:001,12:002,11:003,22:004;Smith:001,Jones:002,Johnson:003,Jones:004;Joe:001,Mary:002,Cathy:003,Bob:004;40000:001,50000:002,44000:003,55000:004;

But that doesn’t answer why columnar solution is better than row based architecture.
When we consider database operation with respect hardware then disk seek time is the major concern since it affects the efficiency
Column oriented databases are faster at certain analytical operations. Queries on large datasets, which don’t fit in memory, involving joins will be significantly faster.Conversely, column oriented databases are significantly slower at handling transactions. The advantages and disadvantages of column oriented databases only really become apparent at some degree of scale either in terms of data size or transaction volume. In practice they are most useful for data warehousing applications on “big” relational data. This becomes the reason why online transaction processing (OLTP)-focused RDBMS systems are more row-oriented, while online analytical processing (OLAP)-focused systems are a balance of row-oriented and column-oriented.
Benefits of columnar database comes into consideration when we talk about data compression since columnar databases can be highly compressed. The compression permits columnar operations like – MIN, MAX, AVG, SUM, COUNT etc. to be performed veer rapidly.Another benefit is that because a column-based DBMSs is self-indexing, it uses less disk space than a relational database management systems( RDMBS) containing the same data.
As the use of in-memory analytics increases , the relative benefits of row-oriented vs column-oriented database may become less important. In-memory analytics is not concerned with efficiently reading or writing  data to a hard disk. Instead , it allows data to be queried in random access memory.

HBase Architecture

Image courtesy : Google Images

HBase is modeled after Google’s BigTable. HBase tables are partitioned into multiple regions with each region storing a range of the table’s rows, and multiple regions are assigned by the master to a region server. i.e. Range of rows of a single table can get stored on different region servers.
                  HBase has the concepts of tables, They are not relational tables, nor does HBase support typical RDBMS concepts of joins, indexes, ACID tranactions, etc. But even though you give these features up, you automatically & transparently gain scalability & fault tolerance. HBase can be described as a key-value store with automatic data versioning.
In HBase if there is no data for a given column family, it simply does not store anything at all. contrast this with a relational database which must store null values explicitly. In addition, when retrieving data in HBase, you should only ask for the specific column families you need; because there can literally be millions of columns in a given row, you need to make sure you ask only for the data you actually need.
HBase utilizes ZooKeeper (a distributed coordination service) to manage region assignments to region servers, and to recover from region server crashes by loading the crashed region server’s regions onto other functioning region servers.
Regions contain an in-memory data store (MemStore) & a persistent data store (HFile), & all regions on a region server share a reference to the write-ahead log (WAL) which is used to store new data that hasn’t yet been persisted to permanent storage & to recover from region server crashes. Each region holds a specific range of row keys & when a region exceeds a configurable size, HBase automatically splits the region into two child regions, which is the key to scaling HBase.
As a table grows, more and more regions are created and spread across the entire cluster. When clients request a specific row key or scan a range of row keys, HBase tells them the regions on which those keys exist, and the clients then communicate directly with the region servers where those regions exist. This design minimizes the number of disk seeks required to find any given row, and optimizes HBase toward disk transfer when returning data. This is in contrast to relational databases, which might need to do a large number of disk seeks before transferring data from disk, even with indexes.
Clients interact with HBase via one of several available APIs, including a native Java API as well as a REST-based interface and several RPC interfaces (Apache Thrift, Apache Avro). You can also use DSLs to HBase from Groovy, Jython, and Scala.
Time now to get into implementation using native java API :
The client APIs provide both DDL (data definition language) and DML (data manipulation language) semantics very much like what you find in SQL for relational databases. Suppose we are going to store information about people in HBase, and we want to start by creating a new table. The following snippet shows how to create a new table using the HBaseAdmin class.

Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
HTableDescriptor tableDescriptor = new HTableDescriptor(TableName.valueOf(“people”));
tableDescriptor.addFamily(new HColumnDescriptor(“personal”));
tableDescriptor.addFamily(new HColumnDescriptor(“contactinfo”));
tableDescriptor.addFamily(new HColumnDescriptor(“creditcard”));
admin.createTable(tableDescriptor);

The people table defined in preceding listing contains three column families: personal, contactinfo, and creditcard. To create a table you create an HTableDescriptor and add one or more column families by adding HColumnDescriptor objects. You then callcreateTable to create the table. Now we have a table, so let’s add some data. The next snippet shows how to use the Put class to insert data on John Doe, specifically his name and email address (omitting proper error handling for brevity).

Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, “people”);
Put put = new Put(Bytes.toBytes(“doe-john-m-12345”));
put.add(Bytes.toBytes(“personal”), Bytes.toBytes(“givenName”), Bytes.toBytes(“John”));
put.add(Bytes.toBytes(“personal”), Bytes.toBytes(“mi”), Bytes.toBytes(“M”));
put.add(Bytes.toBytes(“personal”), Bytes.toBytes(“surame”), Bytes.toBytes(“Doe”));
put.add(Bytes.toBytes(“contactinfo”), Bytes.toBytes(“email”), Bytes.toBytes(“john.m.doe@gmail.com“));
table.put(put);
table.flushCommits();
table.close();

In the above snippet we instantiate a Put providing the unique row key to the constructor. We then add values, which must include the column family, column qualifier, and the value all as byte arrays. As you probably noticed, the HBase API’s utility Bytes class is used a lot; it provides methods to convert to and from byte[] for primitive types and strings. (Adding a static import for thetoBytes() method would cut out a lot of boilerplate code.) We then put the data into the table, flush the commits to ensure locally buffered changes take effect, and finally close the table. Updating data is also done via the Put class in exactly the same manner as just shown in the prior listing. Unlike relational databases in which updates must update entire rows even if only one column changed, if you only need to update a single column then that’s all you specify in the Put and HBase will only update that column.
Retrieving the row we just created is accomplished using the Get class, as shown in the next snippet.

Get get = new Get(Bytes.toBytes(“doe-john-m-12345”));
get.addFamily(Bytes.toBytes(“personal”));
get.setMaxVersions(3);
Result result = table.get(get);

The code in the previous snippet instantiates a Get instance supplying the row key we want to find. Next we use addFamily to instruct HBase that we only need data from the personal column family, which also cuts down the amount of work HBase must do when reading information from disk. We also specify that we’d like up to three versions of each column in our result, perhaps so we can list historical values of each column. Finally, calling get returns a Result instance which can then be used to inspect all the column values returned.
In many cases you need to find more than one row. Using Scan class, you can specify various options, such as the start and ending row key to scan, which columns and column families to include and the maximum versions to retrieve. You can also add filters, which allow you to implement custom filtering logic to further restrict which rows and columns are returned. A common use case for filters is pagination. For example, we might want to scan through all people whose last name is Smith one page (e.g. 25 people) at a time. The next snippet shows how to perform a basic scan.

Scan scan = new Scan(Bytes.toBytes(“smith-“));
scan.addColumn(Bytes.toBytes(“personal”), Bytes.toBytes(“givenName”));
scan.addColumn(Bytes.toBytes(“contactinfo”), Bytes.toBytes(“email”));
scan.setFilter(new PageFilter(25));
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
// …
}

In the above snippet we create a new Scan that starts from the row key smith- and we then use addColumn to restrict the columns returned (thus reducing the amount of disk transfer HBase must perform) to personal:givenName and contactinfo:email. APageFilter is set on the scan to limit the number of rows scanned to 25. (An alternative to using the page filter would be to specify a stop row key when constructing the Scan.) We then get a ResultScanner for the Scan just created, and loop through the results performing whatever actions are necessary.
You can also delete data in HBase using the Delete class, analogous to the Put class to delete all columns in a row (thus deleting the row itself), delete column families, delete columns, or some combination of those.

Connection Handling:-
HBase provides the HConnection class which provides functionality similar to connection pool classes to share connections, for example you use the getTable() method to get a reference to an HTable instance. There is also an HConnectionManager class which is how you get instances of HConnection. Similar to avoiding network round trips in web applications, effectively managing the number of RPCs and amount of data returned when using HBase is important, and something to consider when writing HBase applications.

Stay tuned for second part.

Advertisements

Currently Working as Game Developer at Gaussian Networks Private Limited , ( www.adda52.com ). Did B.Tech in computer science Engineering from G.G.S.I.P University , new delhi. Using Java , ActionScript 3.0 and PHP for development. I am a programmer who loves to adapt new platforms for coding. Reading techs & specs of gadgets is my hobby as i am a 24x7 active web crawler. I consider learning as a best helping aid to yourself as well as for others because it's the best means for diversifying your knowledge.

Tagged with: , , ,
Posted in hadoop

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: