I have read various stack-over flow answers to questions about why pyspark may perform worse than pure python on a local machine. These answers included suggestions that:
- the benefits of parallel processing only become apparent with a large number of cores.
- the benefit of Pyspark do not kick in until the data set size is very large
Regarding point 1 this Stack overflow answer (Spark: Inconsistent performance number in scaling number of cores) referencing Ahmdahl's law suggests if the problem is highly parallelizable then my computer with 8 cores should be able to provide a significant latency improvement if I choose the right problem. A problem which should be highly parallelizable is the common FizzBuzz coding question. That is, for a sequence of integers (1,2,3,.... etc), transform each such that if the number is divisible by 3, replace it with "fizz"; if the number is divisible by 5 replace it with "buzz"; and if the number is divisible by both 3 and 5 replace it with "fizzbuzz". This should be a very highly parallelizable task.
I've implemented this Fizzbuzz function using python pandas, Pyspark dataframes and Pyspark RDDs and varied the size of the input list of numbers to compare how performance is impacted on each of the implementations.
Here is the code for each implementation where the the length of the numbers list was 100 million:
Python Pandas Implementation (100 million numbers):
import pandas as pd def fizzbuzz(num): return "fizzbuzz" if (num % 3 ==0) & (num % 5 == 0) else "fizz" if num % 3 == 0 else "buzz" if num % 5 ==0 else num pd.DataFrame(range(1,100000001))[0].apply(lambda x: fizzbuzz(x))
Pyspark RDD Implementation (100 million numbers)
numbers = spark.sparkContext.parallelize(range(1,100000001)) def fizzbuzz(num): return "fizzbuzz" if (int(num) % 3 ==0) & (int(num) % 5 == 0) else "fizz" if int(num) % 3 == 0 else "buzz" if int(num) % 5 ==0 else str(num) fizzBuzzRDD = numbers.map(lambda x: fizzbuzz(x)) answer_list= fizzBuzzRDD.collect()
Pyspark DataFrame UDF impelementation (100m numbers):
def fizzbuzzSpark(num): return "fizzbuzz" if (int(num) % 3 ==0) & (int(num) % 5 == 0) else "fizz" if int(num) % 3 == 0 else "buzz" if int(num) % 5 ==0 else str(num) fizzbuzz_UDF = functions.udf(fizzbuzzSpark, StringType()) df_spark = spark.createDataFrame(range(1,1000001), StringType()) df_spark = df_spark.withColumn('answer',fizzbuzz_UDF(col("value")))
The graph below shows how execution time varied as the size of the dataset changed (that is as the list of numbers got bigger). [link to graph of execution size vs dataset size][1] [1]: https://i.stack.imgur.com/9zVoA.png
Length of Numbers list 1K 10K 100k 1m 10m 100m
Pandas 0.43 msec 2.3 msec 24.3 msec 246 msec 2.59 sec 24.8 sec
Pyspark RDD 31 msec 36 msec 70 msec 292 msec 2.56 sec 43.3 sec
PySpark Dataframe 13 msec 20 msec 91.2 msec 820 msec 8.12 sec 87 sec
It looks like pandas is consistently outperforming Pyspark. Further, the relationship between dataset size appears to have converged to a constant ratio of Log(execution time) / Log(dataset size) as shown in the graph.
Does this analysis appear correct? Is there likely to be any massive shift in performance in favor of PySpark as the datasets get even bigger? Would I expect drastically different results if this was running on a much larger cluster?