3

My key has three components: num, type, name

The 'type' is only of two kinds A and B while num can have more values e.g. 0,1,2..,30

I have to fetch data with respect to num and type i.e. fetch all rows which have keys with the specified num and type.

I can either store data in the form: 1. num|type|name or 2. type|num|name

Considering how HBase scans through data if I use partial key scanning, which is the best strategy to store data?

This is how I will set my partial key scanning: For 1.

scan.setStartRow(Bytes.toBytes(num);
scan.setStopRow(Bytes.toBytes(num+1);

For 2.

scan.setStartRow(Bytes.toBytes(type + "|" + num);
scan.setStopRow(Bytes.toBytes(type + "|" + (num+1));
Jørgen R
  • 10,568
  • 7
  • 42
  • 59
Monis Iqbal
  • 1,987
  • 7
  • 26
  • 42

2 Answers2

4

First I would recommend against using pipe as a delimiter - that is ASCII 124 and falls after all letters and numbers and sorting will not be what you expect (unless you left pad everything - but that makes for overly large keys). For HBase rowkey delimiters you want to use something that is lexicographically before all of your valid key characters to preserve correct sorting. Tab works well at ASCII 9.

Considering that type only has two valid values and assuming a random distribution I would go with num type. This allows you to select just on num if you need to in the future. Selecting on just num with the reverse order, type num, is two fetchs, once for type 'A' and again for type 'B'. Not the most efficient.

If you will rarely select on just number then it does make sense to go with type num as that is the most selective on the row level, if inflexible.

Really you should try them both out and see what works best with your data.

cftarnas
  • 1,745
  • 10
  • 9
  • Thanks for the detailed reply. A few questions: The delimiter will be the same for all the rows so I'm not sure if I understand correctly how it will affect sorting order. Could effect data size I suppose? – Monis Iqbal Aug 16 '11 at 10:20
  • If we decide to scan w.r.t. num e.g. a specific num '2'. If data is persisted as num|type. Possibility is that all the 2's will be in a single file. In this case, will it reduce parallelism? – Monis Iqbal Aug 16 '11 at 10:20
  • On the contrary if it was persisted as type|num then all A|2's being in one file and all B|2's in another file. Will this increase parallelism while scanning for 2's? – Monis Iqbal Aug 16 '11 at 10:20
  • 1
    I should back up a bit on the delimiter - if you don't care about overall sorting and just want data grouped together then using any delimiter is fine. If you care about how it is sorted then you need to be careful in your choices. – cftarnas Aug 16 '11 at 19:28
  • With num|type as the order there is a good chance that both rows with num '2' will be in the same region/node. With scanners that is good - scanners operate on a single region at a time. If you are more worried about parallel readers from different clients then you would want more distribution and having the type lead might be good. – cftarnas Aug 16 '11 at 19:31
  • And - having the type lead just means you need to do two reads to check each types: look for A|2 then B|2. I'll correct the answer to reflect that. – cftarnas Aug 16 '11 at 19:32
1

There are a couple of approaches you can take.

1) You should choose whichever layout you will be scanning more frequently. Then for the less frequent scan type, you make a full scan(or delimit it to range if yo can) and using filters, you can construct a row filter that filters out anything but items you want. Regarding filters: http://hbase.apache.org/apidocs/index.html

2) You can duplicate your data by storing it twice(once with each rowname). This is going to slow writes, but help a lot with reads if you do scanning on both. Of course disk usage is also doubled.

3) You can construct an index with the alternative row names to point to the relevant rows.

What approach you take will depend heavily on the access patterns of your data and read/write ratio.

juhanic
  • 805
  • 8
  • 16