2

I am reading a json file of size 30 mb, process to create column family and key values. Then create Put object, insert rowkey and values into it. Create list of such put objects and call Table.batch() and pass this list. I am calling this when my arraylist size is 50000. Then clear list and call next batch. However to process while file which eventually has 800,000 entries takes 300 secs. I also tired table.put but it was even slower. I am using hbase 1.1. I am getting that json from Kafka. Any suggestions to improve performance is appreciated. I checked SO forums but not much help. I will share code if you want to have a look at it.

Regards

Raghavendra

public static void processData(String jsonData)
{
    if (jsonData == null || jsonData.isEmpty())
    {
        System.out.println("JSON data is null or empty. Nothing to process");
        return;
    }

    long startTime = System.currentTimeMillis();

    Table table = null;
    try
    {
        table = HBaseConfigUtil.getInstance().getConnection().getTable(TableName.valueOf("MYTABLE"));
    }
    catch (IOException e1)
    {
        System.out.println(e1);
    }

    Put processData = null;
    List<Put> bulkData = new ArrayList<Put>();

    try
    {

        //Read the json and generate the model into a class    
        //ProcessExecutions is List<ProcessExecution>
        ProcessExecutions peData = JsonToColumnData.gson.fromJson(jsonData, ProcessExecutions.class);

        if (peData != null)
        {
            //Read the data and pass it to Hbase
            for (ProcessExecution pe : peData.processExecutions)
            {
                //Class Header stores some header information
                Header headerData = pe.getHeader();   

                String rowKey = headerData.getRowKey();
                processData = new Put(Bytes.toBytes(JsonToColumnData.rowKey));
                processData.addColumn(Bytes.toBytes("Data"),
                                Bytes.toBytes("Time"),
                                Bytes.toBytes("value"));

                //Add to list
                bulkData.add(processData);            
                if (bulkData.size() >= 50000) //hardcoded for demo
                {
                    long tmpTime = System.currentTimeMillis();
                    Object[] results = null;
                    table.batch(bulkData, results);                     
                    bulkData.clear();
                    System.gc();
                }
            } //end for
            //Complete the remaining write operation
            if (bulkData.size() > 0)
            {
                Object[] results = null;
                table.batch(bulkData, results);
                bulkData.clear();
                //Try to free memory
                System.gc();
            }
    }
    catch (Exception e)
    {
        System.out.println(e);
        e.printStackTrace();
    }
    finally
    {
        try
        {
            table.close();
        }
        catch (IOException e)
        {
            System.out.println("Error closing table " + e);
            e.printStackTrace();
        }
    }

}


//This function is added here to show the connection
 /*public Connection getConnection()
{

    try
    {
        if (this.connection == null)
        {
            ExecutorService executor = Executors.newFixedThreadPool(HBaseConfigUtil.THREADCOUNT);
            this.connection = ConnectionFactory.createConnection(this.getHBaseConfiguration(), executor);
        }
    }
    catch (IOException e)
    {
        e.printStackTrace();
        System.out.println("Error in getting connection " + e.getMessage());
    }

    return this.connection;
}*/
AnswerSeeker
  • 203
  • 4
  • 16
  • pls share the code snippet. – Ram Ghadiyaram Jan 30 '17 at 13:56
  • ideally table.batch also works in the similar way as mentioned below. it should also work. – Ram Ghadiyaram Jan 30 '17 at 14:08
  • @RamGhadiyaram, thanks for posting your comment. I read your answer in the other question but that did not help me. Sharing my code in few moments – AnswerSeeker Jan 30 '17 at 14:24
  • have you tried to increase buffer size to 4 mb rather than 2 mb default? – Ram Ghadiyaram Jan 30 '17 at 15:07
  • @RamGhadiyaram I have changed it to 40 MB (41943040) based on your link shared in your answer. From the code if I try setWriteBufferSize, I get a warning saying it is deprecated. Instead I have added hbase.client.write.buffer to the site.xml though it was not present. Will test it and let you know. – AnswerSeeker Jan 30 '17 at 15:16
  • @RamGhadiyaram, the time is reduced is by half!! You are awesome. Now it takes 150 secs to process the same. Should I increase the client write buffer to 100 MB? What will be the impact? – AnswerSeeker Jan 30 '17 at 15:32
  • Replaced by BufferedMutator and BufferedMutatorParams.writeBufferSize(long) [see](https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html) However great.... if you are okay! you can accept answer as owner and pls care to vote it up – Ram Ghadiyaram Jan 30 '17 at 15:40
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/134393/discussion-between-rag-and-ram-ghadiyaram). – AnswerSeeker Jan 30 '17 at 15:42

1 Answers1

1

I had the same case where I need to parse 5 GB json and insert to hbase table ...You can try the below way(which should work), which proved very fast for batch of 100000 records in my case.

public void addMultipleRecordsAtaShot(final ArrayList<Put> puts, final String tableName) throws Exception {
        try {
            final HTable table = new HTable(HBaseConnection.getHBaseConfiguration(), getTable(tableName));
            table.put(puts);
            LOG.info("INSERT record[s] " + puts.size() + " to table " + tableName + " OK.");
        } catch (final Throwable e) {
            e.printStackTrace();
        } finally {
            LOG.info("Processed ---> " + puts.size());
            if (puts != null) {
                puts.clear();
            }
        }
    }

For more details to increase buffer size check my answer in a different context to increase buffer size please refer doc https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html

Community
  • 1
  • 1
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121