-2

I filter a dataset to get a list of datasets which I then want to persist in parallel.

Code:

val yearWiseDsList = years.map(year => ds.filter($"year".rlike(year.toString)))

yearWiseDsList.zipWithIndex.foreach {
        case (xDf, idx) =>
xDf.write.format("csv").option("header", "false").save("mydata" + "_" + (startYear + idx))
}

Currently the foreach runs sequentially. I can convert the yearWiseDsList to par List but then it won't be using spark for the parallelisation

How can I do this with spark?

raphaëλ
  • 6,393
  • 2
  • 29
  • 35
sanjay
  • 354
  • 2
  • 6
  • 27
  • Apache spark has a totally different parallelization scheme. You can start by searching `apache spark rdd tutorial` on Google. – sarveshseri Feb 02 '17 at 15:24
  • Spark rdd does't help me. Either you are suggesting me to do a sc.parallelize the yearWiseDsList which I have already tried and it doesn't work. – sanjay Feb 02 '17 at 18:13
  • ` it doesn't work.`... what does not work ? What is it that you could not achieve ? RDD's are the way of parallelisation in Spark. What else do you want to do ? – sarveshseri Feb 02 '17 at 23:55

1 Answers1

0

The question is regarding nested parallization in spark. The following link answers it.

Nesting parallelizations in Spark? What's the right approach?

Community
  • 1
  • 1
sanjay
  • 354
  • 2
  • 6
  • 27