1

I'm trying to read a csv file from a local file system, create a dataframe from it, delete the file and return the dataframe. Yes, I have to delete it. Since everything is done lazily except for the deletion, the app fails since it can't find the file when the code gets executed.

def do_something() : DataFrame {
       val file = File.createTempFile("query2Output", ".csv")
       //some code which writes to the file 

       val df = sqlContext.read
          .format("com.databricks.spark.csv")
          .option("header", "true")
          .option("mode", "DROPMALFORMED")
          .load(file.getPath)

       file.delete
       df
}
Hagai
  • 275
  • 3
  • 13

1 Answers1

1

You can cache your DataFrame and run i.e. count on RDD to enforce reading just after creating DataFrame:

val df = /* reading*/.cache()
df.count()
file.delete()

However, if processing fail, then you will not have possibility to recreate DataFrame

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • Thanks! I tried .cache() but didn't realize i have to call .count() (or any other action) for it to take effect. – Hagai Mar 13 '17 at 16:47
  • It's lazy :) Check also http://stackoverflow.com/questions/42714291/how-to-force-dataframe-evaluation-in-spark – T. Gawęda Mar 13 '17 at 16:51
  • By the way, why won't df.count() suffice? – Hagai Mar 14 '17 at 11:48
  • @Raytracer Because `count` only does calculation of count. Next action will also read file from disk. If you do `cache` before `count`, then `count` dataset is cached in memory and won't be read from disk – T. Gawęda Mar 14 '17 at 12:03