I am running into some issues using cache on a spark dataframe. My expectation is that after a cache on a dataframe, the dataframe is created and cached the fist time it is needed. Any further calls to the dataframe should be from the cache
here's my code:
val mydf = spark.sql("read about 400 columns from a hive table").
withColumn ("newcol", someudf("existingcol")).
cache()
To test I ran a mydf.count() twice. I would expect the first time to take some time since the data is being cached. But the second time should be instantaneous?
What I am actually seeing is that it takes the same time for both the counts. This first one comes back pretty quickly which I think tells me that the data was not cached. If I remove the withColumn part of the code and just cache the raw data, the second count is instantaneous
Am I doing something wrong? How can I load raw data from hive, add columns and then cache the dataframe for further use? Using spark 2.3
Any help will be great!