I took the below UDFs from the Pyspark website as I am trying to understand if there is a performance improvement. I have made a big range of numbers but both take virtually the same length of time, what am I doing wrong?
Thanks!
import pandas as pd
from pyspark.sql.functions import col, udf
from pyspark.sql.types import LongType
import time
start = time.time()
# Declare the function and create the UDF
def multiply_func(a, b):
return a * b
multiply = udf(multiply_func, returnType=LongType())
# The function for a pandas_udf should be able to execute with local Pandas data
x = pd.Series(list(range(1, 1000000)))
print(multiply_func(x, x))
# 0 1
# 1 4
# 2 9
# dtype: int64
end = time.time()
print(end-start)
And here is the Pandas UDF
import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType
import time
start = time.time()
# Declare the function and create the UDF
def multiply_func(a, b):
return a * b
multiply = pandas_udf(multiply_func, returnType=LongType())
# The function for a pandas_udf should be able to execute with local Pandas data
x = pd.Series(list(range(1, 1000000)))
print(multiply_func(x, x))
# 0 1
# 1 4
# 2 9
# dtype: int64