2

I'm still fairly new to Spark, and I've a question.

Let's say I need to submit a spark application to a 4 node cluster, and each node has the a standalone storage backend (ex. RocksDB) with exactly the same k,v rows, from where I need to read the data to process. I can create an RDD by getting all the rows I need from the storage and calling parallelize on the dataset:

public JavaRDD<value> parallelize(Map<key, value> data){
    return sparkcontext.parallelize(new ArrayList<>(data.values()));
}

However I still need to get every row that I need to process into memory from disk, for every node in the cluster, even though each node is only going to process a part of it, since the data is going to be on the Map structure before creating the RDD.

Is there another way to do this or I'm seeing this wrongly? The database is not supported by hadoop, and I can't use HDFS for this use case. It's not supported by jdbc either.

Thank you in advance.

PablodeAcero
  • 399
  • 8
  • 20
  • 1
    Possible duplicate of [Integrate key-value database with Spark](http://stackoverflow.com/questions/41064850/integrate-key-value-database-with-spark) – Kartoch Mar 26 '17 at 08:01

0 Answers0