0

This is my code:

spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")

spark_df1.count( ) # This command took around 1.40 min for exectuion

spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")

test_data = spark_df1.sample(fraction=0.001)

spark_df2 =  spark_df1.subtract(test_data)

spark_df2.count()  #This command is taking more than 20 min for execution. Can any one help why
                   #its taking long time for same count command?

Why is count() taking long time before and after using subtract command?

vladsiv
  • 2,718
  • 1
  • 11
  • 21
  • Given this is a performance related question follow [this guide](https://stackoverflow.com/a/48428198/7989581) to structure question better. – Nithish Dec 16 '21 at 06:11

1 Answers1

0

The jist is that, subtract is an expensive operation involving joins and distinct incurring shuffled hence would take long time compared to count on spark_df1.count(). How much longer is dependent on the Spark executor configurations and partitioning scheme. Do update the question according to comment to an ind-depth analysis.

Nithish
  • 3,062
  • 2
  • 8
  • 16