4

I have a spark dataframe which i can convert to pandas dataframe using the

toPandas()

method available in pyspark.

I have the following queries regarding this?

  1. Does this conversion break the purpose of using spark itself(Distributed computing)?
  2. The dataset is going to be huge , so what about the speed and memory issues?
  3. If somebody can also explain ,what exactly happens with this one line of code,that would really help.

Thanks

function
  • 1,298
  • 1
  • 14
  • 41

1 Answers1

7

Yes, once toPandas is called on spark-dataframe it will get out of distributed system and new pandas dataframe will be in driver node of cluster.

And if the spark-data frame is huge and if doesnt fit into driver memory it will crash.

WoodChopper
  • 4,265
  • 6
  • 31
  • 55