-1

I am creating a dataframe using pandas.read_csv() from a 8 MB file.

df_ratings = pd.read_csv(r'D:\Study\Lopa\TEDAX_RE\MOVIE_RECOMMENDATION\MOVIE_LENS\MONGO_DB\DATA\INPUT_DATA\ratings.csv')
list_users = df_ratings['userId'].unique().tolist()
print(list_users)

This takes 0.34 sec

Using spark

from pyspark.context import SparkContext

from pyspark.sql.session import SparkSession

sc = SparkContext('local[*]')

spark_df = spark.read.format('csv').options(header='true', inferSchema='true').load(r'D:\Study\Lopa\TEDAX_RE\MOVIE_RECOMMENDATION\MOVIE_LENS\MONGO_DB\DATA\INPUT_DATA\ratings.csv').cache()
spark_df.createOrReplaceTempView("user_table")

query = "SELECT DISTINCT userid FROM user_table"

list_users_data = spark.sql(query).collect()

list_users = [i.userid for i in list_users_data]

print(list_users)

This takes around 16 seconds.

Pyspark code should take less time as compared to Python Pandas code.
Am I missing any configuration?

Note : I am running this code in a Windows system with 8GB RAM 4 core CPU system.

spark = SparkSession(sc)
GileBrt
  • 1,830
  • 3
  • 20
  • 28

1 Answers1

1

Pyspark code should take less time as compared to Python Pandas code.

No, it shouldn't. With a small dataset, most of this time is execution overhead - starting driver, starting workers, DAG creation and execution. Spark should be used for processing of big datasets, that can't fit in the memory of one machine, so you need several workers to process it. If your data is small enough, that one server can handle it - stick to Pandas, you don't need Spark.

Rayan Ral
  • 1,862
  • 2
  • 17
  • 17