This is my code:
spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")
spark_df1.count( ) # This command took around 1.40 min for exectuion
spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")
test_data = spark_df1.sample(fraction=0.001)
spark_df2 = spark_df1.subtract(test_data)
spark_df2.count() #This command is taking more than 20 min for execution. Can any one help why
#its taking long time for same count command?
Why is count()
taking long time before and after using subtract
command?