I have a question regarding the time difference while filtering pandas and pyspark dataframes:
import time
import numpy as np
import pandas as pd
from random import shuffle
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame(np.random.randint(1000000, size=400000).reshape(-1, 2))
list_filter = list(range(10000))
shuffle(list_filter)
# pandas is fast
t0 = time.time()
df_filtered = df[df[0].isin(list_filter)]
print(time.time() - t0)
# 0.0072
df_spark = spark.createDataFrame(df)
# pyspark is slow
t0 = time.time()
df_spark_filtered = df_spark[df_spark[0].isin(list_filter)]
print(time.time() - t0)
# 3.1232
If I increase the length of list_filter
to 10000 then the execution times are 0.01353 and 17.6768 seconds. Pandas implementation of isin seems to be computationally efficient. Can you explain me why filtering of a pyspark dataframe is so slow and how can I perform such filtering fast?