Why is my query faster before caching my dataset in Spark?

Question

I'm trying to compare the performance of Spark queries on datasets based on Parquet files and cached dataset.

Surprisingly queries on Parquet dataset are faster than queries on cached data. I see at least 2 reasons why it should not:

cached data is in memory while parquet file isn't (it's on my SSD)
I'm expecting cached data to be in an optimized format for spark queries

I've done this small benchmark on a 300MB parquet (9M lines), only timing the query time, not the time to cache the data:

def benchmarkSum(ds: org.apache.spark.sql.DataFrame): Double = {
  var begin = System.nanoTime();
  for (int <- 1 to 1000) {
     ds.groupBy().sum("columnName").first()
  }
  return (System.nanoTime() - begin) / 1000000000.0;
}    

val pqt = spark.read.parquet("myfile.parquet");
benchmarkSum(pqt) // 54s

var cached = pqt.cache()
cached.groupBy().sum("columnName").first() // One first call to triggers the caching before benchmark.
benchmarkSum(cached) // 77s

The queries on Parquet took 54s while it took 77s on the cached dataset.
I am doing this benchmark in a spark-shell with 8 cores and 10GB memory.

So why is it slower to use cached data to sum my column? Am I doing something wrong?

Relying on `count` and `cache` in case of dataframes, is a bad practice. — philantrovert, Jul 19 '18 at 13:57
@philantrovert Are you saying that I should never cache a dataframe or never trigger the caching with count ? I tried to remove the count or replace it with other calls and the result are similar. — Fabich, Jul 19 '18 at 14:10
I've removed the call to .count in my question. This is not a duplicate, I don't see how the other question answer this. I am comparing the behaviour of a cached dataset and a dataset based on parquet file while the other questions is about when to trigger the dataset evaluation. — Fabich, Jul 19 '18 at 14:44
`first` will trigger cache on __the first partition__. It won't cache full dataset. Also "cached data is in memory" - not necessarily. It might be on disk. "I'm expecting cached data to be in an optimized format for spark queries" - you're apply global aggregations. It is way cheaper to read only a single set of values and aggregate as they come than building complex and expensive structure first. And even ignoring all of that 54s vs. 77s in a single run is not a statistically meaningful result. Spark can show much higher variance depending on a number of factors. — Alper t. Turker, Jul 20 '18 at 10:37
Which brings us to `System.nanoTime();` - it is a very crude way of measuring time in Spark. It doesn't tell you what exactly contributes to that. If you're looking for actual insights it is better to analyze data from the UI. But overall - majority of the data will be `cached` when you apply `benchmarkSum` for the second time, and it reasonable to expect that it will impact execution time. — Alper t. Turker, Jul 20 '18 at 10:46
@eliasah I am leaning towards reopening this. The other question _partially_ addresses the problem, but I am not convinced that it sufficient? Any objections? — Alper t. Turker, Jul 20 '18 at 10:55
But won't the sum actually cause caching - at least on the column of the parquet file? @user8371915 — thebluephantom, Jul 20 '18 at 14:37
@thebluephantom `sum` will (should) cache the data, which is in fact a problem. It will cache all columns (not only the one required) and cache alone is expensive (computing statistics, possibly sorting and encoding) even in memory, and can become even more if data is put on disk (today Spark uses `MEMORY_AND_DISK` as default) There is a reason why [cache is not default behavior](https://stackoverflow.com/q/34117469/8371915) — Alper t. Turker, Jul 20 '18 at 15:10
OK, thx - but we keep on reading that caching is a good thing. On the other hand all this memory issue with SPARK as well — thebluephantom, Jul 20 '18 at 15:35
@user8371915 I know that cache has a cost, that's why I trigger it with a first call. I also tried to do the benchmark multiple time with consistent result so this is not a problem linked to the first call to cache. I have watched in SparkUI and the dataset is 100% in memory. As System.nanoTime() is not the correct way to get the time I will try with larger datasets and look at the sparkUI timings. — Fabich, Jul 20 '18 at 15:59
It’s clear to me this is a grey area with not much knowledge. Under the hood optimization mmm. Just got out my High Performance Spark book. — thebluephantom, Jul 20 '18 at 17:11
@thebluephantom No, I've tried to scale to larger dataset and this behavior still exists. My opinion is that the parquet file is cached in memory by Linux and is more optimized than Spark internal format for this kind of operation. — Fabich, Aug 01 '18 at 12:47
I think you may well be right - columnar and may be even per block a count of records — thebluephantom, Aug 01 '18 at 13:08

Why is my query faster before caching my dataset in Spark?

0 Answers0