1

Spark DataFrame schema:

In [177]: testtbl.printSchema()
root
 |-- Date: long (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: double (nullable = true)

I wish to apply a scalar-valued function a column of testtbl. Suppose I wish to calculate an average of the 'Close' column. For an rdd I would do something like

rdd.fold(0, lambda x,y: x+y)

But testtbl.Close is not an rdd,, it is a column object with limited functionality. Rows of testtbl are rdds, columns are not. So how to apply add, or a user function to a single column?

Charles Pehlivanian
  • 2,083
  • 17
  • 25

1 Answers1

0

If you want to apply a function to an entire column, you just have to execute an aggregation operation to the column.

For instance imagine that you want to compute the sum of all values in the column values. Even though, df is not aggregated data, it is valid to apply aggregated functions to DataFrames.

from pyspark.sql.functions import *

df = sc.parallelize([(1,), (2,), (3,)]).toDF(["values"])
df.agg(sum("values").alias("sum")).show()

+---+
|sum|
+---+
|  6|
+---+

You can find another example in Pyspark's aggregation documentation.

For the second part of your question. You can create a User Defined Aggregated Function, but if I'm right it is only applicable for Scala.

Community
  • 1
  • 1
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
  • Ok - `agg` is shorthand for `df.groupBy.agg()', but `agg` only supports 5 or so functions according to the [documentation](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData). I'd like to apply a udf to a column (2nd part of question). – Charles Pehlivanian Jul 25 '16 at 18:45
  • `my_user_fn(testtbl.rdd.map(lambda row: row.Close).collect())` ?, but that is inefficient and doesn't return a df. – Charles Pehlivanian Jul 25 '16 at 19:11
  • Yes - udf's applyinng row-wise are standard, I am looking for a udaf. Looks like it is only avail in Scala. Thx. – Charles Pehlivanian Jul 25 '16 at 19:30