Let's say I want to use Hadoop for his great ability to scale applications on a cluster, and working with a lot of data. Let' suppose I have a big bounch of TimeSeries that I can ( can elaborate on this with better ideas ) stored on HBase ( maybe with one column for frame, even for this we can change if some better ideas comes out ). Now the algo has to run and scale on these time series ( a set of them, actually ) but the problem is that in order to work, the algo need one time series + a variable bounch of other time series. This prevent the"data locality" feature of hadoop. Is this acceptable? Is there some better way? Maybe create a custom application instead of Map Reduce?
Asked
Active
Viewed 34 times
0

OneCricketeer
- 179,855
- 19
- 132
- 245

Felice Pollano
- 32,832
- 9
- 75
- 115
-
What about Apache Spark? No Hadoop necessary, if you don't need it – OneCricketeer Nov 14 '16 at 20:08
-
spark can use yarn as a cluster manager, I don't see if it solve the data locality problem – Felice Pollano Nov 14 '16 at 20:09
-
You have data in HBase, not HDFS, so I don't see where the locality factor comes into play. And yes, Spark *can* use YARN, but it is not necessary – OneCricketeer Nov 14 '16 at 20:11
-
@cricket_007 doesn't hbase store data on the cluster as well? so a table can be local to some node and so on? – Felice Pollano Nov 14 '16 at 20:13
-
I don't think so. HBase sits separate from HDFS, AFAIK. [See differences here](http://stackoverflow.com/questions/16929832/difference-between-hbase-and-hadoop-hdfs) – OneCricketeer Nov 14 '16 at 20:16
-
Of course, I may be completely wrong about that... Anyways, there are Spark connectors for Hbase, was my main point. MapReduce over HBase isn't really a good strategy – OneCricketeer Nov 14 '16 at 20:19