1

I'm trying to create a generic method in Java for querying hbase.

I currently have one written which takes in 3 arguments

  • A Range (to scan the table)
  • A Column (to be returned) ... and
  • A Condition (i.e. browser==Chrome)

So a statement (if written in a SQLish language) may look like

SELECT OS FROM TABLE WHERE BROWSER==CHROME IN RANGE (5 WEEKS AGO -> 2 WEEKS AGO)

Now, I know I'm not using HBase properly (using common column queries for rowkey etc.) but for the sake of experimentation I'd like to try it, to help me learn.

So the first thing I do is set a Range on the Scan. (5 weeks to 2 weeks ago), since the rowkey is the timestamp, this is very efficient.

Then I set a SingleColumnValueFilter (browser = Chrome) (after the range filter, this is pretty fast)

Then I store all the rowkeys (from the scan) into an array.

For each rowkey (in the array) I perform a GET operation to get the corresponding OS.

I have tried using MultiGet, which sped up the process a lot.

I then tried using normal GET requests, each spawning a new thread, all running concurrently, which halved the query time! But still not fast enough.

I have considered limiting the number of threads using a single connection to the database. i.e - 100 threads per connection.

Given my circumstances, what is the most efficient way to perform these GETs, or am I totally approaching it incorrectly?

Any help is hugely appreciated.

EDIT (Here is my threaded GET attempt)

List<String> newresults = Collections.synchronizedList(new ArrayList<String>());

for (String rowkey : result) {
    spawnGetThread(rowkey, colname);
}

public void spawnGetThread(String rk, String cn) {
    new Thread(new Runnable() {
        public void run() {

            String rt = "";
            Get get = new Get(Bytes.toBytes(rk));
            get.addColumn(COL_FAM, cn);
            try {
                Result getResult = tb.get(get);
                rt = (Bytes.toString(getResult.value()));
            } catch (IOException e) {
            }
            newresults.add(rt);
        }
    }).start();
}
Greg Peckory
  • 375
  • 1
  • 7
  • 17

1 Answers1

5

Given my circumstances, what is the most efficient way to perform these GETs, or am I totally approaching it incorrectly?

I would suggest the below way

Get is good if you know which rowkeys you can acccess upfront.

In that case you can use method like below , it will return array of Result.

/**
     * Method getDetailRecords.
     * 
     * @param listOfRowKeys List<String>
     * @return Result[]
     * @throws IOException
     */
    private Result[] getDetailRecords(final List<String> listOfRowKeys) throws IOException {
        final HTableInterface table = HBaseConnection.getHTable(TBL_DETAIL);
        final List<Get> listOFGets = new ArrayList<Get>();
        Result[] results = null;
        try {
            for (final String rowkey : listOfRowKeys) {// prepare batch of get with row keys
   // System.err.println("get 'yourtablename', '" + saltIndexPrefix + rowkey + "'");
                final Get get = new Get(Bytes.toBytes(saltedRowKey(rowkey)));
                get.addColumn(COLUMN_FAMILY, Bytes.toBytes(yourcolumnname));
                listOFGets.add(get);
            }
            results = table.get(listOFGets);

        } finally {
            table.close();
        }
        return results;
    }

Additional Note: Rowfilters are always faster than column value filters( Which does full table scan)..

Would suggest to go through hbase-the-definitive guide -->Client API: Advanced Features

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • pls go through my another-answer http://stackoverflow.com/questions/37899344/querying-hbase-efficiently/37908912#37908912 if you are using scan fuzzyrowfilter is fast, but you need to make sure that chrome is part of rowkey. Means that you can add browser type or as Enum and added ordinal to rowkey so that fuzzy row filter jumps between rows and find out the row which you require effieciently. – Ram Ghadiyaram Aug 10 '16 at 18:29
  • Unfortunately my column values won't be of the same length, so `FuzzyRowFilter` is out of the question. Unless maybe I redefine the data to constant lengths with a map to the real values... With regards to multiple GET requests however, spawning new threads sped up the process significantly, it seems the async way is faster than the above way. Is there any problems with this? Can it be improved? – Greg Peckory Aug 11 '16 at 08:34
  • "async way" you mean to say that async hbase? if yes that is used for flume mostly. if you have done some thing in async hbase can you share code snippet of how are you doing ? it sounds bit strange for me. – Ram Ghadiyaram Aug 11 '16 at 09:37
  • Sorry, what I meant was my own custom threaded version of GET. See my EDIT – Greg Peckory Aug 11 '16 at 09:54
  • Also, it seems if I have `timestamp` at the beginning of my `rowkey`, then `fuzzyRowFilter` will not work efficiently, i.e. it won't perform any jumps. This is because `timestamp` is unique. But I need `timestamp` at the beginning of my `rowkey` for Range scans. Do you see my dilemma? – Greg Peckory Aug 11 '16 at 10:47
  • ok now I got that its multithreaded way. have you tested aforementioned code for the same rowkeys you have mentioned in your code snippet ? if yes, haven't realized any quick response ? – Ram Ghadiyaram Aug 11 '16 at 10:51
  • Yes, I tested and it works faster than `MultiGet`, however it is still quite slow. I am considering using `FuzzyRowFilter`, but my problem is that `timestamp` is the prefix for the `rowkey`, and it is unique. Is there a way to work around this? Btw, thanks for all the help! – Greg Peckory Aug 11 '16 at 11:02
  • Imp. thing is timestamp cant be prefix in your case. it can be salted index like 0-7 . timestamps are varied if you put timestamp as prefix then data hotspot will happen – Ram Ghadiyaram Aug 11 '16 at 11:32
  • to avoid hotspot you can create a table with pre splits 0-7 and make sure that the prefix of your row should point to any number between 0-7 . Basically it will ensure that all data loaded to region servers uniformly. and not loaded on one particular region server(which is hot spot) – Ram Ghadiyaram Aug 11 '16 at 11:35
  • I think list of gets is only solution you have – Ram Ghadiyaram Aug 11 '16 at 11:36
  • It seems so. Thanks a lot for all the help! – Greg Peckory Aug 11 '16 at 12:45