I have a process which reads hive(parquet-snappy) table and builds a dataset of 2GB. It is iterative(~ 7K) process and This dataset is going to be the same for all iterations so I decided to cache the dataset.
Somehow cache task is done on one executor only and seems like the cache is on that one executor only. which leads in delay, OOM etc.
Is it because of parquet? How to make sure that cache is distributed on multiple executors?
Here is the spark config:
- Executors : 3
- Core: 4
- Memory: 4GB
- Partition: 200
tried repartition and adjusting config but no luck.