3

I'm recently trying to convert some pure python code to PySpark in order to process some large dataset. Using my small test dataset, I noticed that the PySpark version is actually slower than the pure python+pandas dataframe. I read some comments and that seems to be expected.

So now I have this general question: Do we use Spark because it's "faster" (which seems not the case when the pandas dataframe can fit into main memory)? Or because it's able to handle large volume of data in a distributed fashion which otherwise won't fit into the memory?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
ZEE
  • 188
  • 1
  • 12
  • 2
    https://dorianbg.wordpress.com/2017/08/17/spark-vs-pandas-benchmark-why-you-should-use-spark-only-with-really-big-data/ – BENY Jun 11 '18 at 17:34
  • @Wen Thanks for sharing the article, it provides great insight on the questions I have! – ZEE Jun 11 '18 at 17:58
  • Without being the expert big-data guy: i suppose it's both (although i think more the latter) *and* additional redundancy/safety/robustness (some machine breaks for whatever reason: computation still being done). Additional remark (personal feeling): one high-memory machine in the cloud is usually much more expensive compared to many low-memory machines (exceeding the amount of available memory easily). So if those computations can be somewhat parallelized, this kind of distributed computing can become attractive, just for this (memory) reason (when swap-mem/SSD are not enough). – sascha Jun 11 '18 at 18:38
  • Imho: i don't like that duplicate call, especially considering the title. (ignoring potential other closing-reasons) – sascha Jun 11 '18 at 18:44

0 Answers0