34

A Spark newbie here. I recently started playing around with Spark on my local machine on two cores by using the command:

pyspark --master local[2]

I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy, sum, max, stddev.

However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.

I was wondering what could be a possible reason for this. I have a couple of thoughts.

  1. Do built-in functions do the process of serialization/de-serialization inefficiently? If yes, what are the alternatives to them?
  2. Is the data set too small that it cannot outrun the overhead cost of the underlying JVM on which spark runs?

Thanks for looking. Much appreciated.

Shaido
  • 27,497
  • 23
  • 70
  • 73
Raj
  • 447
  • 6
  • 11
  • 4
    Does it make sense at all to use Apache Spark for such a small data sets? Pandas is very fast, but it doesn't scale. You want to use it instead of Spark unless you're hitting a `MemoryError`. – MaxU - stand with Ukraine Feb 15 '18 at 20:17
  • I totally agree. I am currently trying my hand on it. That's why this question. – Raj Feb 15 '18 at 20:33

1 Answers1

55

Because:

You can go on like this for a long time...

Jim Dennis
  • 17,054
  • 13
  • 68
  • 116
user9366962
  • 616
  • 6
  • 3