Spark newbie here. I tried to do some pandas action on my data frame using Spark, and surprisingly it's slower than pure Python (i.e. using pandas package in Python). Here's what I did:
1) In Spark:
train_df.filter(train_df.gender == '-unknown-').count()
It takes about 30 seconds to get results back. But using Python it takes about 1 second.
2) In Spark:
sqlContext.sql("SELECT gender, count(*) FROM train GROUP BY gender").show()
Same thing, takes about 30 sec in Spark, 1 sec in Python.
Several possible reasons my Spark is much slower than pure Python:
1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.
2) My spark is running locally and I should run it in something like Amazon EC instead.
3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.
4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)
Which one of these is most likely the reason, or the most credible explanation? I would love to hear from some Spark experts. Thank you very much!!