Is there any way to Drop pyspark Dataframe after converting it into pandas Dataframe via toPandas()

Question

I create Spark Dataframe using input text file of size 4GB by using pyspark. then use some condition like:

df.cache() #cache df for fast execution of later instruction
df_pd = df.where(df.column1=='some_value').toPandas() #around 70% of data

Now i am doing all operation on pandas Dataframe df_pd. Now my memory usage become around 13 GB.

Why, so much memory is consumed?
How can i do to make my computation faster and efficient? #here df.cache() lead to took 10 minutes for caching.
I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.

Note : I am mainly using Pyspark because it efficiently using cpu cores and pandas only use single core of my machine for read file operation.

Have you tried using `del df` after calling `.toPandas()`. Check detailed answer here https://stackoverflow.com/a/39967109/455814 — Waqas, Aug 14 '19 at 06:24

score 1 · Answer 1 · answered Aug 14 '19 at 06:27

Why, so much memory is consumed?

I would say duplication of dataframe in memory, as you suggested.

How can i do to make my computation faster and computation efficient? #here df.cache() took 10 minutes to run

df.cache() is only useful if you're going to use this df mutliple times. Think of it as a checkpoint, only useful when you need to do mutliple operations on the same dataframe. Here, it is not necessary since you're doing only one process. More info here.

I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.

unpersist is the right thing to do. About sqlContext.clearCache(), I don't know which version of Spark you're using but you may want to take a look at spark.catalog.clearCache()

Although I know this does not directly answer your question, I hope it may help !

score 0 · Answer 2 · answered Aug 14 '19 at 05:59

0

What about trying to delete the PySpark df? :

del(df)

answered Aug 14 '19 at 05:59

Allen Qin

19,507
8
51
67

Its not freeing up memory i first try del(df) and after that i also tried gc,collect() – GIRISH kuniyal Aug 14 '19 at 06:09

Is there any way to Drop pyspark Dataframe after converting it into pandas Dataframe via toPandas()

2 Answers2