I'm trying to create a generic method in Java
for querying hbase
.
I currently have one written which takes in 3 arguments
- A
Range
(to scan the table) - A
Column
(to be returned) ... and - A
Condition
(i.e.browser==Chrome
)
So a statement (if written in a SQLish language) may look like
SELECT OS FROM TABLE WHERE BROWSER==CHROME IN RANGE (5 WEEKS AGO -> 2 WEEKS AGO)
Now, I know I'm not using HBase
properly (using common column queries for rowkey etc.) but for the sake of experimentation I'd like to try it, to help me learn.
So the first thing I do is set a Range
on the Scan
. (5 weeks to 2 weeks ago), since the rowkey
is the timestamp
, this is very efficient.
Then I set a SingleColumnValueFilter
(browser = Chrome
) (after the range filter, this is pretty fast)
Then I store all the rowkeys
(from the scan) into an array
.
For each rowkey
(in the array) I perform a GET
operation to get the corresponding OS
.
I have tried using MultiGet
, which sped up the process a lot.
I then tried using normal GET
requests, each spawning a new thread, all running concurrently, which halved the query time! But still not fast enough.
I have considered limiting the number of threads using a single connection to the database. i.e - 100 threads per connection.
Given my circumstances, what is the most efficient way to perform these GET
s, or am I totally approaching it incorrectly?
Any help is hugely appreciated.
EDIT (Here is my threaded GET
attempt)
List<String> newresults = Collections.synchronizedList(new ArrayList<String>());
for (String rowkey : result) {
spawnGetThread(rowkey, colname);
}
public void spawnGetThread(String rk, String cn) {
new Thread(new Runnable() {
public void run() {
String rt = "";
Get get = new Get(Bytes.toBytes(rk));
get.addColumn(COL_FAM, cn);
try {
Result getResult = tb.get(get);
rt = (Bytes.toString(getResult.value()));
} catch (IOException e) {
}
newresults.add(rt);
}
}).start();
}