Optimizing conversion between PySpark and pandas DataFrames

Question

I have a pyspark dataframe of 13M rows and I would like to convert it to a pandas dataframe. The dataframe will then be resampled for further analysis at various frequencies such as 1sec, 1min, 10 mins depending on other parameters.

From literature [1, 2] I have found that using either of the following lines can speed up conversion between pyspark to pandas dataframe:

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") 
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

However I am have been unable to see any improvement in performance during the dataframe conversion. All the columns are strings to ensure they are compatible with the PyArrow. In the examples below the time taken is always between 1.4 and 1.5 minutes:

I have seen example where the change in processing time reduced from seconds to milliseconds [3]. I would like to know what I am doing wrong and how to optimize the code further. Thank you.

It might depends on the rest of your code, Spark is use lazy evaluation so depending on the steps before you call the "to_pandas()" the bulk of the execution time might be elsewhere. — Yoan B. M.Sc, Nov 19 '21 at 14:07
Hi, thank you for your reply. But the times shown above are only for the lines included in the screenshot. The times for the remainder of the code has not been included here. — Tanjil, Nov 19 '21 at 14:31
Is the data frame you are converting just a Sprak 'range' as in the example or is it reading from a source and performing some transformations ? — Yoan B. M.Sc, Nov 19 '21 at 14:37
If you are using a notebook your spark configuration is already. You can't change it in the way you are trying to. You need to change the launch config to include this setting and it depends on what version of spark you are using in later versions this is already on by default. — Matt Andruff, Nov 19 '21 at 15:17
Hi Matt, Thank you for your question. I am using Spark 3.1.2 on Azure databricks. — Tanjil, Nov 20 '21 at 10:35

score 1 · Answer 1 · answered Nov 19 '21 at 15:27

1

Depending on the version of spark and the notebook you are using you may not need to change this setting. Since Spark 2.3 spark uses arrow integration by default. Spark context once it is set is read only. (This ~2 minute operation could be your system recognizing you've changed a read only property and relaunching spark and re-running your commands.)

Respectfully, pandas is for small data, and spark is for big data. 13 million rows could be small data but your already complaining about performance, maybe you should stick with spark and use multiple executors/partitions?

You clearly can do this with Pandas, but should you?

answered Nov 19 '21 at 15:27

Matt Andruff

4,974
1
5
21

Hi Matt, thank you again for providing useful information. - Since I am using spark 3.1.2, pyarrow might already be on by default as you mentioned. - I have presented above an exaggerated example. In my real case, the actual dataframe is 350K rows and takes 55 seconds. I am using pandas just to get my proof of concept working. Once the POC is working I will need to figure out how to resample pyspark dataframes. So that I can move away from pandas altogether. – Tanjil Nov 20 '21 at 11:32
Ok got it. Fyi: You can of course take a sample of pyspark rows (it's a built in function) – Matt Andruff Nov 21 '21 at 16:29
Hi Matt, what I meant was finding the pyspark equivalent of the following pandas operation: "df[Value].resample(rule="15min").mean()" – Tanjil Nov 22 '21 at 12:56
Ah gotcha, sorry, check out the link below when it comes time to do that. https://stackoverflow.com/questions/39271374/pyspark-how-to-resample-frequencies – Matt Andruff Nov 22 '21 at 18:45
If you feel this answer was helpful and you are comfortable could you mark it as correct? – Matt Andruff Dec 10 '21 at 20:57

score 0 · Answer 2 · answered Nov 19 '21 at 15:14

0

Have you considered using Koala's? (A panda's clone that uses spark data frames so you can do panda operations on a spark data frame.)

answered Nov 19 '21 at 15:14

Matt Andruff

4,974
1
5
21

Optimizing conversion between PySpark and pandas DataFrames

2 Answers2