I am creating a dataframe using pandas.read_csv()
from a 8 MB file.
df_ratings = pd.read_csv(r'D:\Study\Lopa\TEDAX_RE\MOVIE_RECOMMENDATION\MOVIE_LENS\MONGO_DB\DATA\INPUT_DATA\ratings.csv')
list_users = df_ratings['userId'].unique().tolist()
print(list_users)
This takes 0.34 sec
Using spark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local[*]')
spark_df = spark.read.format('csv').options(header='true', inferSchema='true').load(r'D:\Study\Lopa\TEDAX_RE\MOVIE_RECOMMENDATION\MOVIE_LENS\MONGO_DB\DATA\INPUT_DATA\ratings.csv').cache()
spark_df.createOrReplaceTempView("user_table")
query = "SELECT DISTINCT userid FROM user_table"
list_users_data = spark.sql(query).collect()
list_users = [i.userid for i in list_users_data]
print(list_users)
This takes around 16 seconds.
Pyspark code should take less time as compared to Python Pandas code.
Am I missing any configuration?
Note : I am running this code in a Windows system with 8GB RAM 4 core CPU system.
spark = SparkSession(sc)