Pyspark : Why does creating a dataframe using pyspark take more time as compared to creating a dataframe using pandas

Question

I am creating a dataframe using pandas.read_csv() from a 8 MB file.

df_ratings = pd.read_csv(r'D:\Study\Lopa\TEDAX_RE\MOVIE_RECOMMENDATION\MOVIE_LENS\MONGO_DB\DATA\INPUT_DATA\ratings.csv')
list_users = df_ratings['userId'].unique().tolist()
print(list_users)

This takes 0.34 sec

Using spark

from pyspark.context import SparkContext

from pyspark.sql.session import SparkSession

sc = SparkContext('local[*]')

spark_df = spark.read.format('csv').options(header='true', inferSchema='true').load(r'D:\Study\Lopa\TEDAX_RE\MOVIE_RECOMMENDATION\MOVIE_LENS\MONGO_DB\DATA\INPUT_DATA\ratings.csv').cache()
spark_df.createOrReplaceTempView("user_table")

query = "SELECT DISTINCT userid FROM user_table"

list_users_data = spark.sql(query).collect()

list_users = [i.userid for i in list_users_data]

print(list_users)

This takes around 16 seconds.

Pyspark code should take less time as compared to Python Pandas code.
Am I missing any configuration?

Note : I am running this code in a Windows system with 8GB RAM 4 core CPU system.

spark = SparkSession(sc)

@Srinivas - Both the outputs are same.its a list of 1000 user_id's.[102524, 137501, 31367, 95994, 104688,.......] — Lopamudra Guru, May 22 '20 at 13:47

score 1 · Answer 1 · answered May 22 '20 at 16:13

Pyspark code should take less time as compared to Python Pandas code.

No, it shouldn't. With a small dataset, most of this time is execution overhead - starting driver, starting workers, DAG creation and execution. Spark should be used for processing of big datasets, that can't fit in the memory of one machine, so you need several workers to process it. If your data is small enough, that one server can handle it - stick to Pandas, you don't need Spark.

Pyspark : Why does creating a dataframe using pyspark take more time as compared to creating a dataframe using pandas

1 Answers1