2

I create Spark Dataframe using input text file of size 4GB by using pyspark. then use some condition like:

df.cache() #cache df for fast execution of later instruction
df_pd = df.where(df.column1=='some_value').toPandas() #around 70% of data

Now i am doing all operation on pandas Dataframe df_pd. Now my memory usage become around 13 GB.

  • Why, so much memory is consumed?
  • How can i do to make my computation faster and efficient? #here df.cache() lead to took 10 minutes for caching.
  • I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.

Note : I am mainly using Pyspark because it efficiently using cpu cores and pandas only use single core of my machine for read file operation.

GIRISH kuniyal
  • 740
  • 1
  • 5
  • 14

2 Answers2

1

Why, so much memory is consumed?

I would say duplication of dataframe in memory, as you suggested.

How can i do to make my computation faster and computation efficient? #here df.cache() took 10 minutes to run

df.cache() is only useful if you're going to use this df mutliple times. Think of it as a checkpoint, only useful when you need to do mutliple operations on the same dataframe. Here, it is not necessary since you're doing only one process. More info here.

I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.

unpersist is the right thing to do. About sqlContext.clearCache(), I don't know which version of Spark you're using but you may want to take a look at spark.catalog.clearCache()

Although I know this does not directly answer your question, I hope it may help !

Rafaël
  • 977
  • 8
  • 17
0

What about trying to delete the PySpark df? :

del(df)
Allen Qin
  • 19,507
  • 8
  • 51
  • 67