0

I have read various stack-over flow answers to questions about why pyspark may perform worse than pure python on a local machine. These answers included suggestions that:

  1. the benefits of parallel processing only become apparent with a large number of cores.
  2. the benefit of Pyspark do not kick in until the data set size is very large

Regarding point 1 this Stack overflow answer (Spark: Inconsistent performance number in scaling number of cores) referencing Ahmdahl's law suggests if the problem is highly parallelizable then my computer with 8 cores should be able to provide a significant latency improvement if I choose the right problem. A problem which should be highly parallelizable is the common FizzBuzz coding question. That is, for a sequence of integers (1,2,3,.... etc), transform each such that if the number is divisible by 3, replace it with "fizz"; if the number is divisible by 5 replace it with "buzz"; and if the number is divisible by both 3 and 5 replace it with "fizzbuzz". This should be a very highly parallelizable task.

I've implemented this Fizzbuzz function using python pandas, Pyspark dataframes and Pyspark RDDs and varied the size of the input list of numbers to compare how performance is impacted on each of the implementations.

Here is the code for each implementation where the the length of the numbers list was 100 million:

  • Python Pandas Implementation (100 million numbers):

     import pandas as pd
     def fizzbuzz(num):
         return "fizzbuzz" if (num % 3 ==0) & (num % 5 == 0) else "fizz" if num % 3 == 0 else "buzz" if num % 5 ==0 else num
    
     pd.DataFrame(range(1,100000001))[0].apply(lambda x: fizzbuzz(x))
    
  • Pyspark RDD Implementation (100 million numbers)

     numbers = spark.sparkContext.parallelize(range(1,100000001))
     def fizzbuzz(num):
         return "fizzbuzz" if (int(num) % 3 ==0) & (int(num) % 5 == 0) else "fizz" if int(num) % 3 == 0 else "buzz" if int(num) % 5 ==0 else str(num)
     fizzBuzzRDD = numbers.map(lambda x: fizzbuzz(x))
     answer_list= fizzBuzzRDD.collect()
    
  • Pyspark DataFrame UDF impelementation (100m numbers):

     def fizzbuzzSpark(num):
         return "fizzbuzz" if (int(num) % 3 ==0) & (int(num) % 5 == 0) else "fizz" if int(num) % 3 == 0 else "buzz" if int(num) % 5 ==0 else str(num)
     fizzbuzz_UDF = functions.udf(fizzbuzzSpark, StringType())
     df_spark = spark.createDataFrame(range(1,1000001), StringType())
     df_spark = df_spark.withColumn('answer',fizzbuzz_UDF(col("value")))
    

The graph below shows how execution time varied as the size of the dataset changed (that is as the list of numbers got bigger). [link to graph of execution size vs dataset size][1] [1]: https://i.stack.imgur.com/9zVoA.png

    Length of Numbers list  1K            10K       100k        1m        10m        100m

    Pandas                  0.43 msec     2.3 msec  24.3 msec   246 msec  2.59 sec   24.8 sec
    Pyspark RDD             31 msec       36 msec   70 msec     292 msec  2.56 sec   43.3 sec
    PySpark Dataframe       13 msec       20 msec   91.2 msec   820 msec  8.12 sec   87 sec

It looks like pandas is consistently outperforming Pyspark. Further, the relationship between dataset size appears to have converged to a constant ratio of Log(execution time) / Log(dataset size) as shown in the graph.

Does this analysis appear correct? Is there likely to be any massive shift in performance in favor of PySpark as the datasets get even bigger? Would I expect drastically different results if this was running on a much larger cluster?

Chandella07
  • 2,089
  • 14
  • 22
Maq
  • 9
  • 1
  • 1
    just don't use udf. They perform horribly on pyspark dataframes. Same for RDD because you're also calling Python code. Try with pure spark sql code and it should be swift – mck May 20 '21 at 11:56
  • Thanks @mck when you say "pure spark" are you suggesting coding it in Scala? – Maq May 20 '21 at 12:05
  • no, using spark sql (e.g. case when) – mck May 20 '21 at 12:07

0 Answers0