1

I use HFileOutputFormat to bulk load CSV files into a hbase table. I have only map and no reduce task with job.setNumReduceTasks(0). But I could see that a reducer runs in the job, is this reducer started because of HFileOutputFormat?
Previously I was using TableOutputFormat for the same job, in which never a reducer ran. I recently refactored the map task to use HFileOutputFormat, but now after this change, i could see a reducer running.

Secondly am getting the below error in the reducer, which i wasn't getting previously with TableOutputFormat, is this also related to HFileOutputFormat?

Error: java.lang.ClassNotFoundException: com.google.common.base.Preconditions

RGC
  • 332
  • 1
  • 2
  • 12

2 Answers2

2

The HFileOutputFormat indeed starts a (for HFiles necessary) reduce task.

The error pops up there Hadoop needs Google's Guava library in order to produce HFiles. The easiest way to let Hadoop find this library is to just copy it from $HBASE_HOME/lib/ to $HADOOP_HOME/lib/. Look for guava-<version>.jar.

Pieterjan
  • 617
  • 6
  • 17
  • am trying to optimize my MR job using your tips in this [post](http://stackoverflow.com/questions/8750764/what-is-the-fastest-way-to-bulk-load-data-into-hbase-programmatically) . I have only a map task which reads a csv file and loads each line (record) into a hbase table. The performance has improved, but still i dont think it is efficient enough, as it takes around 10 mins to load 3million records. you mentioned that u were able to load 2.5M in a min. I have pre-split the table regions. What else i could do achieve max efficiency? compressing the data? Please advise – RGC Apr 22 '13 at 10:44
  • I tried to compress the mapoutput as well as the hfiles. That dint show any improvement in the performance. pls advice on wat i could be missing or need to be done to load millions of data within a min. Please note that i dont do any heavy processing other than just forming a key and calling context.write(immutablerow,put). I noticed that map completed in around 3-4 mins and the reducer(invoked by HFileOutputFormat) took around 6-7mins, and completebulkload got completed like a flash. – RGC Apr 22 '13 at 14:28
0

Yes, even if we set number of Reducers to zero, HFileOutputFormat initiates a reducer task to sort and merge the mapper output to make this file HTable compatible. The number of reducers is equal to number of regions in HBase table

Find a sample code to prepare data for HBase bulk load via a MapReduce job, here

Prasad D
  • 1,496
  • 1
  • 14
  • 8