I create Spark Dataframe using input text file of size 4GB by using pyspark. then use some condition like:
df.cache() #cache df for fast execution of later instruction
df_pd = df.where(df.column1=='some_value').toPandas() #around 70% of data
Now i am doing all operation on pandas Dataframe df_pd. Now my memory usage become around 13 GB.
- Why, so much memory is consumed?
- How can i do to make my computation faster and efficient? #here df.cache() lead to took 10 minutes for caching.
- I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.
Note : I am mainly using Pyspark because it efficiently using cpu cores and pandas only use single core of my machine for read file operation.