I have a script snippet that I am running in different cluster setups on pyspark 2.4
v1 = spark.read.parquet(os.path.join(v1_prefix, 'df1.parquet'))
v2 = spark.read.parquet(os.path.join(v2_prefix, 'df2.parquet'))
out = v1.join(v2, [v1.Id == v2.Id, v1.Year == v2.Year, v1.Month == v2.Month])
for x in v1.columns:
tmp = out.select(v1[x].alias(x + '_old'), v2[x].alias(x + '_new')).filter('{}_old != {}_new'.format(x,x ))
if tmp.count() > 0:
tmp.show()
Both of these are dataframes with 200+ columns and 1.5 million records, so out dataframe has 400+ columns that are compared to each other to determine whether there are differences.
- single node cluster takes 4-8 minutes
- 2 node cluster takes ~ 50 minutes
I assume that in 2-node cluster data is partitioned over different executors and being shuffled, which slows down the performance.
How to improve out dataframe so it will be evenly distributed and it will be running at least with the same performance as ran on single node using spark 2.4?