0

I am working on a project where i need to read 12 files average file size is 3 gb. I read them with RDD and create dataframe with spark.createDataFrame. Now I need to process 30 Sql queries on the dataframe most them need output of previous one like depend on each other so i save all my intermediate state in dataframe and create temp view for that dataframe.

The program takes only 2 minutes for execute part but the problem is while writing them to csv file or show the results or calling count() function takes too much time. I have tries re-partition thing but still it is taking to much time.
1.What could be the solution?
2.Why it is taking too much time to write even all processing taking small amount of time?

  • You have to show _some_ code so we can at least see what's going on. How do you know it only takes 2 minutes to execute? Are you aware that spark is lazy? Please read [ask] and try to provide a [mcve]. See more on [how to provide good reproducible spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). – pault Feb 26 '18 at 14:37

1 Answers1

0

I solved above problem with persist and cache in pyspark.

Spark is a lazy programming language. Two types of Spark RDD operations are- Transformations and Actions. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed. When the action is triggered after the result, new RDD is not formed like transformation.

Every time i do some operation it was just transforming, so if i call that particular dataframe it will it parent query every time since spark is lazy,so adding persist stopped calling parent query multiple time. It saved lots of processing time.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83