parallelize list of dataset spark

Question

I filter a dataset to get a list of datasets which I then want to persist in parallel.

Code:

val yearWiseDsList = years.map(year => ds.filter($"year".rlike(year.toString)))

yearWiseDsList.zipWithIndex.foreach {
        case (xDf, idx) =>
xDf.write.format("csv").option("header", "false").save("mydata" + "_" + (startYear + idx))
}

Currently the foreach runs sequentially. I can convert the yearWiseDsList to par List but then it won't be using spark for the parallelisation

How can I do this with spark?

Apache spark has a totally different parallelization scheme. You can start by searching `apache spark rdd tutorial` on Google. — sarveshseri, Feb 02 '17 at 15:24
Spark rdd does't help me. Either you are suggesting me to do a sc.parallelize the yearWiseDsList which I have already tried and it doesn't work. — sanjay, Feb 02 '17 at 18:13
` it doesn't work.`... what does not work ? What is it that you could not achieve ? RDD's are the way of parallelisation in Spark. What else do you want to do ? — sarveshseri, Feb 02 '17 at 23:55

score 0 · Answer 1 · edited May 23 '17 at 10:29

0

The question is regarding nested parallization in spark. The following link answers it.

Nesting parallelizations in Spark? What's the right approach?

edited May 23 '17 at 10:29

Community

1
1

answered Feb 02 '17 at 18:21

sanjay

354
2
6
27

parallelize list of dataset spark

1 Answers1