4

I'm new at Hbase. I'm facing a problem when bulk loading data from a text file into Hbase. Assuming I have a following table:

Key_id | f1:c1 | f2:c2
row1     'a'     'b'
row1     'x'     'y'
  1. When I parse 2 records and put it into Hbase at the same time (same timestamps), then only version {row1 'x' 'y'} updated. Here is the explanation:

When you put data into HBase, a timestamp is required. The timestamp can be generated automatically by the RegionServer or can be supplied by you. The timestamp must be unique per version of a given cell, because the timestamp identifies the version. To modify a previous version of a cell, for instance, you would issue a Put with a different value for the data itself, but the same timestamp.

I'm thinking about the idea that specify the timestamps but I don't know how to set automatically timestamps for bulkloading and Does it affect the loading performance?? I need fastest and safely importing process for big data.

  1. I tried to parse and put Each record into table, but the speed is very very slow...So another question is: How many records/size of data should in batch before put into hbase. (I write a simple java program to put. It's slower much more than I use Imporrtsv tool by commands to import. I dont know exactly how many size in batch of this tool..)

Many thx for your advise!

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
Bing Farm
  • 75
  • 1
  • 8

2 Answers2

2

Q1: Hbase maintains versions using timestamps. If you wont provide it will take default provided by hbase system.

In the put request you can update custom time as well if you have such requirement. It doesn't not effect performance.

Q2 : You can do it in 2 ways.

  • Simple java client with batching technique shown below.

  • Mapreduce importtsv(batch client)

Ex: #1 Simple java client with batching technique.

I used hbase puts in batch List objects of 100000 record for parsing json(similar to your standalone csv client )

Below is code snippet through which I achieved this. Same thing can be done while parsing other formats as well)

May be you need to call this method in 2 places

1) with Batch of 100000 records.

2) For processing reminder of your batch records are less than 100000

  public void addRecord(final ArrayList<Put> puts, final String tableName) throws Exception {
        try {
            final HTable table = new HTable(HBaseConnection.getHBaseConfiguration(), getTable(tableName));
            table.put(puts);
            LOG.info("INSERT record[s] " + puts.size() + " to table " + tableName + " OK.");
        } catch (final Throwable e) {
            e.printStackTrace();
        } finally {
            LOG.info("Processed ---> " + puts.size());
            if (puts != null) {
                puts.clear();
            }
        }
    }

Note : Batch size internally it is controlled by hbase.client.write.buffer like below in one of your config xmls

<property>
         <name>hbase.client.write.buffer</name>
         <value>20971520</value> // around 2 mb i guess
 </property>

which has default value say 2mb size. once you buffer is filled then it will flush all puts to actually insert in to your table.

Furthermore, Either mapreduce client or stand alone client with batch technique. batching is controlled by above buffer property

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
0

If you need to overwrite record, you can configure hbase table to remember only one version.

This page explains how to do Bulk loading to hbase at maximum possible speed:

How to use hbase bulk loading and why

miroB
  • 468
  • 6
  • 8
  • Thx for your reply. As the requirement, from an user-key, I need to collect all information from this user. So, I design table that the row-key is the user_id. The column family contain recharge information this user, so it contains much history of recharge--> must have many versions. I setup Hadoop and Hbase both in standalone mode, because I only have one server. I load data file from local system... So any ideas for me?? – Bing Farm Jul 29 '16 at 03:36
  • I still do not understand what you need, you want to update existing database, initialize empty database? I have idea for you, it is using different technology for such setup. For example mysql. Another idea is not to use hbase timestamps for tracking history, but encode timestamp into column name, or combine multiple records in on column. – miroB Jul 29 '16 at 03:54