22

Spark newbie here. I tried to do some pandas action on my data frame using Spark, and surprisingly it's slower than pure Python (i.e. using pandas package in Python). Here's what I did:

1) In Spark:

train_df.filter(train_df.gender == '-unknown-').count()

It takes about 30 seconds to get results back. But using Python it takes about 1 second.

2) In Spark:

sqlContext.sql("SELECT gender, count(*) FROM train GROUP BY gender").show()

Same thing, takes about 30 sec in Spark, 1 sec in Python.

Several possible reasons my Spark is much slower than pure Python:

1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.

2) My spark is running locally and I should run it in something like Amazon EC instead.

3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.

4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)

Which one of these is most likely the reason, or the most credible explanation? I would love to hear from some Spark experts. Thank you very much!!

Vicky Zhang
  • 489
  • 1
  • 6
  • 12
  • 5
    Using `pyspark` isn't really going to be the problem - the Spark process is still written in Scala and how you interface with it doesn't affect the fact it has a Java backend. The real problem is that your data set/computations are not large enough or significant enough to overcome the coordination overhead and latency introduced by using Spark (24 MB of data is still in the realm of local computation). Spark is useful for parallel processing, but you need to have enough work/computation to 'eat' the overhead that Spark introduces. – wkl Jan 06 '16 at 04:15

1 Answers1

19

Python will definitely perform better compared to pyspark on smaller data sets. You will see the difference when you are dealing with larger data sets.

By default when you run spark in SQL Context or Hive Context it will use 200 partitions by default. You need to change it to 10 or what ever valueby using sqlContext.sql("set spark.sql.shuffle.partitions=10");. It will be definitely faster than with default.

1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.

You are right, you will not see much difference at lower volumes. Spark can be slower as well.

2) My spark is running locally and I should run it in something like Amazon EC instead.

For your volume it might not help much.

3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.

Again it does not matter for 20MB data set.

4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)

On stand alone there will be difference. Python has more run time overhead than scala, but on larger cluster with distributed capability it need not matter

Durga Viswanath Gadiraju
  • 3,896
  • 2
  • 14
  • 21
  • 2
    In spark 1.5.2 the `sqlContext.sql("set spark.sql.shuffle.partitions=10")` in pyspark crashes. `sqlContext.setConf('spark.sql.shuffle.partitions', '10')` works just fine though. Note the argument must be a string. – Pyrce Apr 12 '16 at 22:04
  • 4
    "You are right, you will not see much difference at lower volumes" - could this be (roughly) quantified for the OP's case. Would you see an advantage at, say, 0.1 GB, 1GB, 10GB, 100GB, 1TB? If someone could point to a question where that is answered then that would be helpful. – jka.ne Sep 07 '16 at 11:02
  • 2
    The OP reports a slowdown of about 30x for 200k rows. It seems I'm having a slowdown of maybe 100-1000x for 100K rows (35 minutes on `local[10]` which -as expected- is better than `local[2]` or `local[6]`) just for a Logistic Regression. I've tried several values for `spark.conf.set("spark.sql.shuffle.partitions", X)`. Any ideas what might be the case for me? – Kostas Mouratidis May 24 '19 at 13:21
  • Thank you, for unit tests I went from 6 minutes to 1 minute by setting number of partitions to 1. For tiny datasets, this is great – JBernardo Oct 23 '21 at 02:07