I filter a dataset to get a list of datasets which I then want to persist in parallel.
Code:
val yearWiseDsList = years.map(year => ds.filter($"year".rlike(year.toString)))
yearWiseDsList.zipWithIndex.foreach {
case (xDf, idx) =>
xDf.write.format("csv").option("header", "false").save("mydata" + "_" + (startYear + idx))
}
Currently the foreach
runs sequentially. I can convert the yearWiseDsList
to par
List
but then it won't be using spark for the parallelisation
How can I do this with spark?