0

What is the difference between Spark Data frame and Pandas Data Frame. My understanding is Pandas data frame is primarily useful for reading CSV data into a DF, where as the Spark data frame is used to load the RDD (Resilient Distributed Data) into Data frame and can manipulate the Data frame. Please share your feedback

user4157124
  • 2,809
  • 13
  • 27
  • 42
vinSan
  • 519
  • 2
  • 6
  • 15
  • 2
    The main difference is the Spark DF is a distributed object but the Panda DF is centric. – Soheil Pourbafrani Sep 01 '20 at 23:47
  • 2
    Does this answer your question? [How Spark Dataframe is better than Pandas Dataframe in performance?](https://stackoverflow.com/questions/55912334/how-spark-dataframe-is-better-than-pandas-dataframe-in-performance) – kavetiraviteja Sep 02 '20 at 05:08

1 Answers1

1

Pandas data frames are in-memory on single server with limited server memory, and transformation and process only on single server, in short you are not using distributed computing with power of multiple servers and big clusters memory.

Spark data frames are distributed on spark cluster so their size is limited by the size of your cluster increase or decrease easy to scale and coming with Spark framwork support.

vaquar khan
  • 10,864
  • 5
  • 72
  • 96