1

I would like to easily load data in my HBase table. I thought using the ImportTsv tool would be ideal. With something like this :

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=;' -Dimporttsv.columns=HBASE_ROW_KEY,f:i tab import.tsv

I want values in the column "f:i" to be stored as bytes (hex) NOT as strings. Because direct consequence is that I am unable to query that column with filters needing to make integers comparisons.

1 - If I use put in the shell :

p = Put.new(Bytes.toBytes('r1'))
p.add(bytes('f'), Bytes.toBytes('i'), Bytes.toBytes(10));
tab.put(p)

I get :

r1  column=f:i, timestamp=1398519413393, value=\x00\x00\x00\x00\x00\x00\x00\x0A

2 - If I use the ImportTsv tool I get :

r1  column=f:i, timestamp=1398519413393, value=10

But in this case my scans with the following filter (as an example) won't work anymore :

f = SingleColumnValueFilter.new(
  Bytes.toBytes('f'),
  Bytes.toBytes('i'),
  CompareFilter::CompareOp::LESS_OR_EQUAL,
  BinaryComparator.new(Bytes.toBytes(70))
)

So basically, is there a simple way to fine tune ImportTsv tool so that it stores the numbers like in the first case ?

Thanks a lot for your help !

Community
  • 1
  • 1
tony
  • 618
  • 1
  • 6
  • 12

2 Answers2

1

Tony, no luck. Importtsv is wrong tool for binary data. Actually it is not good tool at all.

Looks like you need solution similar to what I do:

  • MapReduce job which imports your data and outputs HFile image.
  • completebulkload tool to bulk load prepared HFile.

Reference: https://hbase.apache.org/book/arch.bulk.load.html

More details:

  • For importing MapReduce job you actually need only mapper. This mapper shall produce sequence of Put objects. Look for importtsv itself.
  • Rest of importing job is just configured with things like HFileOutputFormat2.configureIncrementalLoad(Job, HTable).
  • I recommend to use HFileV2 due to number of reasons starting from luck of HFile V1 support in modern HBase clusters.
  • completebulkload is just ready-to-use tool. I personally have my custom MapReduce job for this stage because I have native things like Snappy in my tables and don't want to install any native things on client. So I just start single mapper which takes HFile image from HDFS and merge with specified table.

Looks somewhat complex but indeed it worth to do yourself. Benefit is MUCH more efficient ETL operations.

Roman Nikitchenko
  • 12,800
  • 7
  • 74
  • 110
  • Thanks Roman. I suppose you are right but I wished there was a much easier solution :-) Temporarily, I parse & import my data file with the Jruby interface. It's longer but easier to implement :-) – tony Apr 29 '14 at 09:07
  • Depends on your requirements. I really need strong solution so I prefer to spend time but receive working thing. Currently I'm evaluating Spark for such class of tasks. Yet more complex but really powerful. BTW where is my +1 ;-) ? – Roman Nikitchenko Apr 29 '14 at 12:19
  • Yet note: https://dataddict.wordpress.com/2013/03/08/some-upcoming-features-in-hbase-0-96/comment-page-1/ so please be careful. HBase 0.96 actually drops HFile V1 out of the market so I'd prefer standard solutions for bulk loading. V3 could be soon the only option. – Roman Nikitchenko Apr 29 '14 at 12:20
  • 1
    We finally slightly modified the class 'TsvImporterMapper.java' to do what we need with integers. Indeed much much faster than using a layer of JRuby. You got your +1 ;-) – tony Apr 30 '14 at 11:42
  • But please check you are using HFileV2 infrastructure. V1 is dying (for Cloudera, for example, is already dead). – Roman Nikitchenko Apr 30 '14 at 12:07
  • ;) Looks like profit from this question will include yet +1.5 for 'answer' mark. Right now I'm investigating possibility to do what you are doing but based on Scala. Has sense because of Spark. So far so good but somewhat complex due to different language logic. – Roman Nikitchenko Apr 30 '14 at 12:10
0

I had the same issue, and ended up writing a small bash script to encode the hexadecimal characters of the tsv files.

encode.sh

#!/bin/bash
# Transforms hexadecimal characters, e.g. \xFF to binary
# http://stackoverflow.com/questions/10929453/bash-scripting-read-file-line-by-line
while IFS='' read -r line || [[ -n "$line" ]]; do
    echo -e "$line"
done < "$1"

./encode.sh $TABLE.tsv|hadoop fs -put - $HDFS_PATH/$TABLE.tsv

Mikael Valot
  • 444
  • 4
  • 6