4

I have to build a tool which will process our data storage from HBase(HFiles) to HDFS in parquet format.

Please suggest one of the best way to move data from HBase tables to Parquet tables.

We have to move 400 million records from HBase to Parquet. How to achieve this and what is the fastest way to move data?

Thanks in advance.

Regards,

Pardeep Sharma.

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
Pardeep Sharma
  • 572
  • 5
  • 20
  • "Parquet" you mean to say parquet avro ? Hbase is schema less where as parquet avro file has schema. what do you want to do with this data in parquet ? If you are using binary/protobuf these are the complex data types may create some issues while creating parquet. please see my answer. – Ram Ghadiyaram May 04 '16 at 10:40
  • 1
    Yes it's Parquet avro. In our next step we'll use these parquet files for testing. Thanks for your immediate reply. – Pardeep Sharma May 04 '16 at 11:50

2 Answers2

1

I recently opensourced a patch to HBase which tackles the problem you are describing. Have a look here: https://github.com/ibm-research-ireland/hbaquet

Yiannis Gkoufas
  • 660
  • 6
  • 16
0

Please have a look in to this project tmalaska/HBase-ToHDFS which reads a HBase table and writes the out as Text, Seq, Avro, or Parquet

Example usage for parquet :

Exports the data to Parquet

hadoop jar HBaseToHDFS.jar ExportHBaseTableToParquet exportTest c export.parquet false avro.schema
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121