Suppose I create the following dataframe:
dt = pd.DataFrame(np.array([[1,5],[2,12],[4,17]]),columns=['a','b'])
df = spark.createDataFrame(dt)
I want to create a third column, c, that is the sum of these two columns. I have the following two ways to do so.
The withColumn() method in Spark:
df1 = df.withColumn('c', df.a + df.b)
Or using sql:
df.createOrReplaceTempView('mydf')
df2 = spark.sql('select *, a + b as c from mydf')
While both yield the same results, which method is computationally faster?
Also, how does sql compare to a spark user defined function?