Spark Dataset cache is using only one executor

Question

I have a process which reads hive(parquet-snappy) table and builds a dataset of 2GB. It is iterative(~ 7K) process and This dataset is going to be the same for all iterations so I decided to cache the dataset.

Somehow cache task is done on one executor only and seems like the cache is on that one executor only. which leads in delay, OOM etc.

Is it because of parquet? How to make sure that cache is distributed on multiple executors?

Here is the spark config:

Executors : 3
Core: 4
Memory: 4GB
Partition: 200

tried repartition and adjusting config but no luck.

@thebluephantom: Thanks for your response. Parquet was the issue here. I changed table type to Avro and then spark was able to divide into multiple blocks and cache into multiple executors. — SMaZ, Oct 29 '18 at 04:05
Interesting points. Parquet is always a little different. And normally faster. — thebluephantom, Oct 29 '18 at 07:36

score 5 · Answer 1 · answered Aug 13 '19 at 04:48

For anyone who comes across this thread in future, have a similar experience to share. I was building an ML model with 400K rows and 20K features, in one 25M parquet file. All the optimisations I tried w.r.t partitions or executors failed to work. All the .fit calls were using one executor only. After struggling for a week, I broke the data into multiple file chunks of 500 rows each, and suddenly all the optimisations kick in, and was able to train within a few minutes instead of hours earlier.

Maybe some Spark expert can help explain why such is the case, but if you are struggling with non-operative optimisations, this may work for you.

I am facing the same situation but I really still wonder why is that. 25M is a small siez of data, it should really be easy on any cluster but it seems that it is not the case for some reason — Mehdi LAMRANI, Nov 20 '21 at 12:44

score 4 · Accepted Answer · answered Oct 30 '18 at 02:45

4

I am answering my own question but it is interesting finding and It's worth sharing as @thebluephantom suggested.

So here the situation was in spark code I was reading data from 3 hive parquet tables and building the dataset. Now in my case, I am reading almost all columns from each table (approx 502 columns) and parquet is not ideal for this situation. But the interesting thing was spark was not creating blocks(partitions) for my data and caching entire dataset(~2GB) in just one executor.

Moreover, during my iterations, only one executor was doing all of the tasks.

Also, spark.default.parallelism and spark.sql.shuffle.partitions were not in my control. After changing it to Avro format I could actually tune the partitions, shuffles, each executor tasks etc. as per my need.

Hope this helps! Thank you.

answered Oct 30 '18 at 02:45

SMaZ

2,515
1
12
26

Good to share, some oddities here and there for sure. – thebluephantom Aug 13 '19 at 11:40
Changing from Parquet to Avro should not be imho the prime reason for the change. It has to be doing with something else. What exactly, the spark gods only know... – Mehdi LAMRANI Nov 20 '21 at 12:45
Is this parameter spark.sql.shuffle.partitions needs to be specified before we cache the dataset? spark.config.set("spark.sql.shuffle.partitions",1024) dataset.cache() – nirmal Nov 23 '22 at 16:15

Spark Dataset cache is using only one executor

2 Answers2

Linked