I'm recently trying to convert some pure python code to PySpark in order to process some large dataset. Using my small test dataset, I noticed that the PySpark version is actually slower than the pure python+pandas dataframe. I read some comments and that seems to be expected.
So now I have this general question: Do we use Spark because it's "faster" (which seems not the case when the pandas dataframe can fit into main memory)? Or because it's able to handle large volume of data in a distributed fashion which otherwise won't fit into the memory?