Querying Hbase efficiently

Question

I'm using Java as a client for querying Hbase.

My Hbase table is set up like this:

ROWKEY     |     HOST     |     EVENT
-----------|--------------|----------
21_1465435 | host.hst.com |  clicked
22_1463456 | hlo.wrld.com |  dragged
    .             .             .
    .             .             .
    .             .             .

The first thing I need to do is get a list of all ROWKEYs which have host.hst.com associated with it.

I can create a scanner at Column host and for each row value with column value = host.hst.com I will add the corresponding ROWKEY to the list. Seems pretty efficient. O(n) for getting all rows.

Now is the hard part. For each ROWKEY in the list, I need to get the corresponding EVENT.

If I use a normal GET command to get the cell at (ROWKEY, EVENT), I believe a scanner is created at EVENT which takes O(n) time to find the correct cell and return the value. Which is pretty bad time complexity for each individual ROWKEY. Combining the two gives us O(n^2).

Is there a more efficient way of going about this?

Thanks a lot for any help in advance!

score 3 · Accepted Answer · edited May 23 '17 at 12:08

3

What is your n here?? With the RowKey in hand - I presume you mean the HBase rowkey - not some handcrafted one?? - that is fast/easy for HBase. Consider that to be O(1).

If instead the ROWKEY is an actual column you created .. then there is your issue. Use the HBase provided rowkey instead.

So let's move on - assuming you either (a) already properly use the hbase provided rowkey - or have fixed your structure to do so.

In that case you can simply create a separate get for each (rowkey, EVENT) value as follows:

Perform a `get` with the given `rowkey`. 
In your result then filter out EVENT in <yourEventValues for that rowkey>

So you will end up fetching all recent (latest timestamp) entries for the given rowkey. This is presumably small compared to 'n' ?? Then the filtering is a fast operation on one column.

You can also speed this up by doing a batched multiget. The savings comes from reduced round trips to the HBase master and parsings/plan generation by the master/region servers.

Update Thanks to the OP: I understand the situation more clearly. I am suggesting to simply use the "host | " as the rowkey. Then you can do a Range Scan and obtain the entries from a single Get / Scan.

Another update

HBase supports range scans based on prefixes of the rowkey. So you have foobarRow1, foobarRow2, .. etc then you can do a range scan on (foobarRow, foobarRowz) and it will find all of the rows that have rowkeys starting with foobarRow - and with any alphanumeric characters following.

Take a look at this HBase (Easy): How to Perform Range Prefix Scan in hbase shell

Here is some illustrative code:

SingleColumnValueFilter filter = new SingleColumnValueFilter(
   Bytes.toBytes("columnfamily"),
   Bytes.toBytes("storenumber"),
   CompareFilter.CompareOp.NOT_EQUAL,
   Bytes.toBytes(15)
);
filter.setFilterIfMissing(true);
Scan scan = new Scan(
   Bytes.toBytes("20110103-1"),
   Bytes.toBytes("20110105-1")
);
scan.setFilter(filter);

Notice that the 20110103-1 and 20110105-1 provide a range of rowkeys to search.

edited May 23 '17 at 12:08

Community

1
1

answered Jun 18 '16 at 20:29

WestCoastProjects

58,982
91
316
560

Thanks a lot for the answer. I wrote a method which scans through the column `HOST` and returns a String list of all the corresponding `ROWKEYs` with `host=x`. This takes 3 seconds. Then I wrote a method which loops through all these `ROWKEYs` and `GETs` all of their `EVENTs`. This takes aound 120 seconds. How could this be `O(1)` for each `GET`? – Greg Peckory Jun 19 '16 at 09:53
By `n` I mean the number of rows. Also I am using the default Rowkeys, not my own custom column – Greg Peckory Jun 19 '16 at 09:55
@GregPeckory OK now I "get it". So I updated my answer to suggest: use a concatenated Rowkey that consists of " ROWKEY" . In that case you can do a range scan on the "", "" that will return **all** of the entries for that host in one `get`. – WestCoastProjects Jun 19 '16 at 15:19
Sounds perfect! So it is indeed the case that returning a column for n rowkeys takes n gets? – Greg Peckory Jun 20 '16 at 08:17
NO! If the n rowkeys were contiguous you can use a single rowscan – WestCoastProjects Jun 20 '16 at 08:29
Apologies if I'm a bit slow, very new to hbase. Could you explain how I might use a scan or multiget using my table structure. I really appreciate the help – Greg Peckory Jun 20 '16 at 08:36
Hi @GregPeckory . You're probably just missing one piece of info: i have updated the answer – WestCoastProjects Jun 20 '16 at 08:46
Thanks. That makes sense for using `host ` as a prefix for the rowkey. Which I will probably implement. Suppose I kept `host` as a column. Then it would be n `GETs ` to achieve this? – Greg Peckory Jun 20 '16 at 08:58
@GregPeckory You probably do not want to have the `host` in *both* the rowkey and its own column: it would be an unnecessary usage of space. The calling program can parse out the host from the rowkey. Now if instead you were thinking to simply not include the `host` in the rowkey at all - then we're back to square one. In that case - given you are searching for disparate *non-continguous* rows then - yes - each `host` would require its own get. – WestCoastProjects Jun 20 '16 at 09:03

Ram Ghadiyaram · Answer 2 · 2016-06-20T03:43:29.717

First thing is, your rowkey design should be perfect based on which you can define your access pattern to query.

1) Get is good if you know which rowkeys you can acccess upfront

In that case you can use method like below , it will return array of Result.

/**
     * Method getDetailRecords.
     * 
     * @param listOfRowKeys List<String>
     * @return Result[]
     * @throws IOException
     */
    private Result[] getDetailRecords(final List<String> listOfRowKeys) throws IOException {
        final HTableInterface table = HBaseConnection.getHTable(TBL_DETAIL);
        final List<Get> listOFGets = new ArrayList<Get>();
        Result[] results = null;
        try {
            for (final String rowkey : listOfRowKeys) {// prepare batch of get with row keys
   // System.err.println("get 'yourtablename', '" + saltIndexPrefix + rowkey + "'");
                final Get get = new Get(Bytes.toBytes(saltedRowKey(rowkey)));
                get.addColumn(COLUMN_FAMILY, Bytes.toBytes(yourcolumnname));
                listOFGets.add(get);
            }
            results = table.get(listOFGets);

        } finally {
            table.close();
        }
        return results;
    }

2)

In my experience with Hbase Scan performance is bit low if we dont have perfect rowkey design. I recommend if you are opting for scan for the above mentioned scenario by you.

FuzzyRowFilter(see hbase-the-definitive) This is really useful in our case We have used bulk clients like map-reduce as well as standalone hbase clients

This filter acts on row keys, but in a fuzzy manner. It needs a list of row keys that should be returned, plus an accompanying byte[] array that signifies the importance of each byte in the row key. The constructor is as such:

FuzzyRowFilter(List<Pair<byte[], byte[]>> fuzzyKeysData)

The fuzzyKeysData specifies the mentioned significance of a row key byte, by taking one of two values:

0 Indicates that the byte at the same position in the row key must match as-is. 1 Means that the corresponding row key byte does not matter and is always accepted.

Example: Partial Row Key Matching A possible example is matching partial keys, but not from left to right, rather somewhere inside a compound key. Assuming a row key format of _, with fixed length parts, where is 4, is 2, is 4, and is 2 bytes long. The application now requests all users that performed certain action (encoded as 99) in January of any year. Then the pair for row key and fuzzy data would be the following:

row key "????99????_01", where the "?" is an arbitrary character, since it is ignored. fuzzy data = "\x01\x01\x01\x01\x00\x00\x00\x00\x01\x01\x01\x01\x00\x00\x00" In other words, the fuzzy data array instructs the filter to find all row keys matching "????99????_01", where the "?" will accept any character.

An advantage of this filter is that it can likely compute the next matching row key when it comes to an end of a matching one. It implements the getNextCellHint() method to help the servers in fast-forwarding to the next range of rows that might match. This speeds up scanning, especially when the skipped ranges are quite large. Example 4-12 uses the filter to grab specific rows from a test data set.

Example filtering by column prefix

List<Pair<byte[], byte[]>> keys = new ArrayList<Pair<byte[], byte[]>>();
keys.add(new Pair<byte[], byte[]>(
  Bytes.toBytes("row-?5"), new byte[] { 0, 0, 0, 0, 1, 0 }));
Filter filter = new FuzzyRowFilter(keys);

Scan scan = new Scan()
  .addColumn(Bytes.toBytes("colfam1"), Bytes.toBytes("col-5"))
  .setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
  System.out.println(result);
}
scanner.close();

The example code also adds a filtering column to the scan, just to keep the output short:

Adding rows to table... Results of scan:

keyvalues={row-05/colfam1:col-01/1/Put/vlen=9/seqid=0,
           row-05/colfam1:col-02/2/Put/vlen=9/seqid=0,
           ...
           row-05/colfam1:col-09/9/Put/vlen=9/seqid=0,
           row-05/colfam1:col-10/10/Put/vlen=9/seqid=0}
keyvalues={row-15/colfam1:col-01/1/Put/vlen=9/seqid=0,
           row-15/colfam1:col-02/2/Put/vlen=9/seqid=0,
           ...
           row-15/colfam1:col-09/9/Put/vlen=9/seqid=0,
           row-15/colfam1:col-10/10/Put/vlen=9/seqid=0}

The test code wiring adds 20 rows to the table, named row-01 to row-20. We want to retrieve all the rows that match the pattern row-?5, in other words all rows that end in the number 5. The output above confirms the correct result.

Thanks for the detailed answer. In the case of 1), you say that GET is good if I know the ROWKEYs. Which I do. Would you know why running GET on around 70000 rows takes 2 minutes. But filtering ROWKEYs based on column values takes 3 seconds. I figured this would be correct, since HBase is column-oriented. But everyone is saying GET is very efficient and `O(1)` — Greg Peckory, Jun 19 '16 at 16:52
Its surprise to me.. if you pass 7000 rowkeys as a batch to above mentioned method it should be faster. Based on column values you are searching then it will do full table scan to find out that value (which should be slow unless... incidentally you have all those rows are present in same region) — Ram Ghadiyaram, Jun 19 '16 at 16:58
one thumb rule is, Always rowkey based access(row filter) should be faster than column value based access(column filter) — Ram Ghadiyaram, Jun 19 '16 at 17:00

Querying Hbase efficiently

2 Answers2

Linked