I have a dataframe with two columns "date" and "value", how do I add 2 new columns "value_mean" and "value_sd" to the dataframe where "value_mean" is the average of "value" over the last 10 days (including the current day as specified in "date") and "value_sd" is the standard deviation of the "value" over the last 10 days?
Asked
Active
Viewed 7,176 times
4
-
If data can be naturally partitioned by some (for example year or month) you can use a solution described in http://stackoverflow.com/q/33207164/1560062 Otherwise this problem is not a good fit for DataFrames. – zero323 Feb 11 '16 at 19:58
-
Thanks! Average (mean) works beautifully over window partitions but not "stddev". Do you know how to do standard deviation on WindowSpec? – May Xue Feb 15 '16 at 18:05
-
Build your own from basic expressions. See: http://stackoverflow.com/a/31791275/1560062 – zero323 Feb 15 '16 at 22:32
1 Answers
3
Spark sql provide the various dataframe function like avg,mean,sum etc.
you just have to apply on dataframe column using spark sql column
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
Create private method for standard deviation
private def stddev(col: Column): Column = sqrt(avg(col * col) - avg(col) * avg(col))
Now you can create sql Column for average and standard deviation
val value_sd: org.apache.spark.sql.Column = stddev(df.col("value")).as("value_sd")
val value_mean: org.apache.spark.sql.Column = avg(df.col("value").as("value_mean"))
Filter your dataframe for last 10 days or as you want
val filterDF=df.filter("")//put your filter condition
Now yon can apply the aggregate function on your filterDF
filterDF.agg(stdv, value_mean).show

Gabber
- 7,169
- 3
- 32
- 46