Using hadoop to scale out a calculation: how to proper design

Question

Let's say I want to use Hadoop for his great ability to scale applications on a cluster, and working with a lot of data. Let' suppose I have a big bounch of TimeSeries that I can ( can elaborate on this with better ideas ) stored on HBase ( maybe with one column for frame, even for this we can change if some better ideas comes out ). Now the algo has to run and scale on these time series ( a set of them, actually ) but the problem is that in order to work, the algo need one time series + a variable bounch of other time series. This prevent the"data locality" feature of hadoop. Is this acceptable? Is there some better way? Maybe create a custom application instead of Map Reduce?

What about Apache Spark? No Hadoop necessary, if you don't need it — OneCricketeer, Nov 14 '16 at 20:08
spark can use yarn as a cluster manager, I don't see if it solve the data locality problem — Felice Pollano, Nov 14 '16 at 20:09
You have data in HBase, not HDFS, so I don't see where the locality factor comes into play. And yes, Spark *can* use YARN, but it is not necessary — OneCricketeer, Nov 14 '16 at 20:11
@cricket_007 doesn't hbase store data on the cluster as well? so a table can be local to some node and so on? — Felice Pollano, Nov 14 '16 at 20:13
I don't think so. HBase sits separate from HDFS, AFAIK. [See differences here](http://stackoverflow.com/questions/16929832/difference-between-hbase-and-hadoop-hdfs) — OneCricketeer, Nov 14 '16 at 20:16
Of course, I may be completely wrong about that... Anyways, there are Spark connectors for Hbase, was my main point. MapReduce over HBase isn't really a good strategy — OneCricketeer, Nov 14 '16 at 20:19

Using hadoop to scale out a calculation: how to proper design

0 Answers0