Pyspark count() is taking long time before and after using subtract command

Question

This is my code:

spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")

spark_df1.count( ) # This command took around 1.40 min for exectuion

spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")

test_data = spark_df1.sample(fraction=0.001)

spark_df2 =  spark_df1.subtract(test_data)

spark_df2.count()  #This command is taking more than 20 min for execution. Can any one help why
                   #its taking long time for same count command?

Why is count() taking long time before and after using subtract command?

Given this is a performance related question follow [this guide](https://stackoverflow.com/a/48428198/7989581) to structure question better. — Nithish, Dec 16 '21 at 06:11

score 0 · Answer 1 · answered Dec 16 '21 at 06:13

The jist is that, subtract is an expensive operation involving joins and distinct incurring shuffled hence would take long time compared to count on spark_df1.count(). How much longer is dependent on the Spark executor configurations and partitioning scheme. Do update the question according to comment to an ind-depth analysis.

Pyspark count() is taking long time before and after using subtract command

1 Answers1