0

I’m using spark 2.2, i have a spark dataframe that contains about 3millions of rows and 15 columns. When i’m applying .toPandas() on my dataframe, lots of yarn resource is consumed, about 87GB allocated memory
I was wondering if this behavious is normal or not ?
Thanks in adance

Bryan A
  • 3
  • 1
  • Does this answer your question? [collect() or toPandas() on a large DataFrame in pyspark/EMR](https://stackoverflow.com/questions/47536123/collect-or-topandas-on-a-large-dataframe-in-pyspark-emr) – notNull Jul 22 '20 at 14:19
  • I already saw this post, but it does not resolve my problem. In fact, my initial table contains 250millions of rows. After applying some transformations, i only got 3 million of rows. Then i convert this spark dataframe to a pandas dataframe. However, when i check the yarn UI, my job is consuming 87GB of allocated memory. And i don’t figure out why. When i save the pandas df to csv file in the machine. It’s only 350 MB – Bryan A Jul 22 '20 at 14:29

0 Answers0