I am using HBase to store webtable content like how google is using bigtable.
For reference of google bigtable
My question is on RowKey, how we should be forming it.
What google is doing is saving the URL in a reverse order as you can see in the PDF document "com.cnn.www" so that all the links associated with cnn.com will be manages in same block of GFS which will be lot easier to scan.
I can use the same thing as google is using but wont it will be cool if I use some algorithm to compress the url
For eg.
RewKey | Google Bigtable | Algorithm output
www.cnn.com/index.php | com.cnn.www/index.php | 12as/435
www.cnn.com/news/business/index.html | com.cnn.www/news/business/index.html | 12as/2as/dcx/asd
www.cnn.com/news/sports/index.html | com.cnn.www/news/sports/index.html | 12as/2as/eds/scf
Reason behind doing this is rowkey will be shorter as per the Hbase design schema (Mentioned in topic 6.3.2.3. Rowkey Length).
So what do I need from you guys is to know am I correct over here....
Also if I am correct what Algorithm I should using. I am using python over thrift as a programming language so code will be overwhelming for me...