4

I took the below UDFs from the Pyspark website as I am trying to understand if there is a performance improvement. I have made a big range of numbers but both take virtually the same length of time, what am I doing wrong?

Thanks!

import pandas as pd
from pyspark.sql.functions import col, udf
from pyspark.sql.types import LongType
import time

start = time.time()
# Declare the function and create the UDF
def multiply_func(a, b):
    return a * b

multiply = udf(multiply_func, returnType=LongType())

# The function for a pandas_udf should be able to execute with local Pandas data
x = pd.Series(list(range(1, 1000000)))
print(multiply_func(x, x))
# 0    1
# 1    4
# 2    9
# dtype: int64
end = time.time()
print(end-start)

And here is the Pandas UDF

import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType
import time

start = time.time()
# Declare the function and create the UDF
def multiply_func(a, b):
    return a * b

multiply = pandas_udf(multiply_func, returnType=LongType())

# The function for a pandas_udf should be able to execute with local Pandas data
x = pd.Series(list(range(1, 1000000)))
print(multiply_func(x, x))
# 0    1
# 1    4
# 2    9
# dtype: int64
kikee1222
  • 1,866
  • 2
  • 23
  • 46
  • 4
    pandas_udf are optimized and faster for grouped operations, like applying a pandas_udf after a groupBy. The grouping allows pandas to perform vectorized operations and will be faster than normal udf. for normal case like a*b, a normal spark udf will suffice and be faster. – murtihash May 12 '20 at 19:40

1 Answers1

4

Unless your data is large enough such that it cannot be processed by just one node spark should not be considered.

Pandas perform all its operation on single node while spark distributes the data to multiple nodes for processing.

So if your compare performance over small set of data pandas can outperform spark.

Shubham Jain
  • 5,327
  • 2
  • 15
  • 38
  • 1
    Thanks - I was testing on a small dataset as the live data i have is 6PB and I didnt want to use that as my playground. But! Maybe I need to! – kikee1222 May 12 '20 at 19:49
  • For 6 PB definitely you have to opt spark and perform a lot of cluster optimization and code optimization. – Shubham Jain May 12 '20 at 19:53
  • Make sure you have configured multiple workers. If you running it all on a local machine, make sure you have enough CPUs. Yes, this should be obvious, but I can't tell you how much "working", multi-threaded code I have been handed that fails as soon as an extra processor thread is available. – Devon_C_Miller Jun 21 '20 at 12:24