1

I have a pilot HBase cluster with 1 master and 5 slave nodes. I want to access (basically write ad impression data via GET's) the cluster via its REST API. I want to be able to run aggregated reports using Hadoop/Hive?Pig (TBD) later, so I want a single picture of the data.

Do I start the REST server on the master and just write to that single endpoint, or do I start a REST server instance on each slave node and load balance writes across the slave nodes?

(The latter doesn't seem right but I saw some mention in docs about that so am a little confused).

mdahlman
  • 9,204
  • 4
  • 44
  • 72
Nitin
  • 715
  • 1
  • 7
  • 20

2 Answers2

1

I use the rest api with load balancing provided through nginx. Your nginx config would look something like this...

upstream cluster
{
    server master:1234;
    server slave1:1234;
    server slave2:1234;
    server slave3:1234;
    server slave4:1234;
}
server
{
    listen 4444;
    server_name someserver.com;
    location /
    {
        proxy_pass http://cluster;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
    }
}

You would run on all servers in the cluster

hbase rest -p 1234 start

You would call someserver.com:4444 for your rest calls.

Eulalie367
  • 198
  • 4
  • 17
0

If you don't want the REST server to be a bottleneck, then you want to run several of them and load balance across them.

I'm not sure if I'd run those on the datanodes themselves or on another group of boxes. Parsing the REST messages at high frequency might impact the performance of HBase itself.

Chris Shain
  • 50,833
  • 6
  • 93
  • 125