3

We are using HBase for storing data which is sqooped from oracle to hdfs. Here we designed the row key as byte array value. Row key is framed with composite key like (Md5(schema name).getBytes() + Md5(date (format = yyyy-mm-dd)).getBytes() + ByteBuffer.allocate(8).putLong(pkid).array()). Here PKID is a long value.

If I want to get all the rows for a particular schema and for particular date, I can a query the hbase table using startrow and endrow, or any other way to query like this?

When I store my row key as string like user1_20130123, ..., user1_20130127 I am able to filter the table using

scan 'TempTable', {
    COLUMNS => ['CF:NAME'],
    LIMIT => 10,
    STARTROW => 'user1_20100101',
    ENDROW => 'user1_20100115'
}

Here I am getting the rows for user1 with in those dates. When I store the row key as like above how can I query?

Jørgen R
  • 10,568
  • 7
  • 42
  • 59
GHK
  • 241
  • 1
  • 8
  • 19

1 Answers1

3

You have a problem with your rowkeys, if you hash the date you won't be able to use it as a start/stop row for your scans.

Your rowkeys should be something like this:

[16B_schema_MD5_hash][8B_long_timestamp][8B_pkid]

Which you can query like this:

Scan myScan = new Scan( 
    Bytes.add(Bytes.toBytes(schemaNameMD5Hash), Bytes.toBytes(startTimestamp)),
    Bytes.add(Bytes.toBytes(schemaNameMD5Hash), Bytes.toBytes(stopTimestamp))
);
Rubén Moraleda
  • 3,017
  • 1
  • 18
  • 20
  • Ok I got it, so If I store the row key as [16B_schema_MD5_hash][yyyymmdd][8B_pkid], then I will be able to query, I tested the same with row key as "user1_20130101". correct me if I am wrong. – GHK Jan 24 '14 at 05:23
  • Sure, you can do it that way because dates in that particular format are sorted ok. Just keep in mind that you'll get the same 32 byte keys (yyyymmdd is an 8byte string) with a lot less precision (days instead of milliseconds). – Rubén Moraleda Jan 24 '14 at 07:32
  • Will there be any performance issue when we store row key with combination of hash values and string, long(converted byte array of string, converted byte array of long) values. Means, querying the records from HBase table will be a performance issue with this row key design? – GHK Jan 28 '14 at 07:18
  • 1
    No, it won't, HBase rowkeys are just Byte Arrays, it doesn't matter what kind of data are you using to generate them. Just try to get them as small as possible (i.e: if there are not many types of "schemas", don't use a full 16B MD5 hash, just use the first half of 8B) – Rubén Moraleda Jan 28 '14 at 08:19
  • Yes, we are using only few bytes(5) out of it. – GHK Jan 28 '14 at 08:57
  • With this row key design we are trying to do pre-splitting the table, For row key here we are using bucketting concept (total 16 buckets) for the first byte and the rest is for schema, date, and pkid value which is of total 26 bytes length of row key. I used following notion for table pre splitting, but when I describe the table I am not getting what I expected. Can you help me on this. – GHK Feb 11 '14 at 05:52
  • HBase shell script : create 'MYTRRANSDATA', {NAME => 'CF', COMPRESSION => 'SNAPPY', VERSIONS => 1,SPLITS => ['00000000000000000000000000','10000000000000000000000000', '20000000000000000000000000','30000000000000000000000000', '40000000000000000000000000','50000000000000000000000000', '60000000000000000000000000','70000000000000000000000000', '80000000000000000000000000','90000000000000000000000000', 'a0000000000000000000000000','b0000000000000000000000000', 'c0000000000000000000000000','d0000000000000000000000000', 'e0000000000000000000000000','f0000000000000000000000000']} – GHK Feb 11 '14 at 05:54
  • You've got a problem in your query, splits are not set per family. Try this instead: `create 'MYTRRANSDATA', {NAME => 'CF', COMPRESSION => 'SNAPPY', VERSIONS => 1}, {SPLITS => ['1', '2','3', '4','5', '6','7', '8','9', 'a','b', 'c','d', 'e','f']}` – Rubén Moraleda Feb 21 '14 at 07:06
  • Yes, I found the solution in some other sites, but still I am not sure if I can pass my splits as like above, because my row key is 26 bytes length, with combinations, so I am not sure, if splits are pre defined as like this, which will fit my data based on buckets : SPLITS => ['00000000000000000000000000','10000000000000000000000000', '20000000000000000000000000','30000000000000000000000000', ... ] or is there any other way to define binary split ranges? – GHK Feb 21 '14 at 13:37
  • You don't need to set the full split key, just use the first char like the example I wrote: Anything starting with an "a" will be sent to the "a" region :) – Rubén Moraleda Feb 21 '14 at 15:34
  • Oh is it, this I am not aware, I will try in this way and will come back here the out come. – GHK Feb 24 '14 at 05:07
  • I have created a table with pre splits as you have suggested, but I could see all the writes are happening to only one region, because of this I could see sqoop job is taking more time to sqoop data into HBase table, I am not sure where it is going wrong. Sqoop will sqoop multiple records from oracle to hbase table, so I should be able to see records going to different regions, but here I can see all are going to only one region, I am attaching the screen shot for your reference, https://www.hightail.com/download/elNLcmxURnd5UkVPd3NUQw – GHK Feb 24 '14 at 09:15
  • Your splits are fine. Are you completely sure that the first byte of your rowkey is a 1 byte char?. If you're using a 4b integer instead your first byte would be \x00 and the row will be sent to the first region every time. It would help if you post some row keys (use the hbase shell). – Rubén Moraleda Feb 24 '14 at 19:20
  • Row Key is generated as One Byte Bucket (BTRID % 16) + Last Nine Bytes of MD5(schema) + date string (yyyymmdd) + BTRID byte array of 8 bytes. Example Row keys are : for given schema=alpha70, date=20140109. Record 1: BTRID=10849355, Bucket=10849355%16=11=B, RowKey= \x0B\x9A\xA35\xF7\xA5\xAF\x98\x1Fo20140109\x00\x00\x00\x00\x00\xA5\x8CK Record 2: BTRID=10849360, Bucket=10849360%16=0, RowKey= \x00\x9A\xA35\xF7\xA5\xAF\x98\x1Fo20140109\x00\x00\x00\x00\x00\xA5\x8CP Record 3: BTRID=10849363, Bucket=10849363%16=3, RowKey= \x03\x9A\xA35\xF7\xA5\xAF\x98\x1Fo20140109\x00\x00\x00\x00\x00\xA5\x8CS – GHK Feb 25 '14 at 06:52
  • 1
    Please notice that the splits we configured are not hexadecimal, they're characters, that way, any row starting with a "b" (\x62 HEX) will go to the "b" region, or any row below "1" (\x31 HEX) will be sent to the "0" region, that's why everything is going to the first region. I'll recommend you to update your script so the first byte is an ASCII character representing the HEX value [0-9a-f] (or to update your splits to use your one byte HEX. Take a look at this: https://gist.github.com/rmruano/9204707 – Rubén Moraleda Feb 25 '14 at 08:01
  • I have created the hbase table as you suggested with hex splits like {SPLITS => ['\x01', '\x02','\x03', '\x04','\x05', '\x06','\x07', '\x08','\x09', '\x0A','\x0B', '\x0C','\x0D', '\x0E','\x0F']}, but when I check the table I could see the regions with values like start key=\x5Cx01 end key=\x5Cx02, start key=\x5Cx02 end key=\x5Cx03 ..... Here ASCII value \x5C = \, so \x02 is converted to \x5Cx03, and all the rows again going to one region that is first region because all rows first byte value is lower than \x5C. Am I doing any thing wrong here. – GHK Feb 25 '14 at 11:24
  • I don't have any conversion, could it be an issue of your version?. Anyway, I'll just stick to use ASCII chars as splits, it's a lot easier to handle, I would just do a md5 of your BTRID and extract the first char. – Rubén Moraleda Feb 25 '14 at 11:47
  • The HBase version we are using is HBase 0.94.6-cdh4.4.0. Is this version different to yours? Mean while I will also try converting the first byte of my row key into ASCII and see the final out come, I will update here with my final observations. And if you have any suggestions please share here. – GHK Feb 25 '14 at 13:09
  • By converting the first byte value to ASCII, I could see the data is moving to different regions, I will check with this conversion all our existing use cases are working fine or not. – GHK Feb 25 '14 at 13:58
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/48376/discussion-between-ruben-moraleda-and-ghk) – Rubén Moraleda Feb 25 '14 at 14:41