2

I am trying to create a new column ("newaggCol") in a Spark Dataframe using groupBy and sum (with PySpark 1.5). My numeric columns have been cast to either Long or Double. The columns used to form the groupBy are String and Timestamp. My code is as follows

df= df.withColumn("newaggCol",(df.groupBy([df.strCol,df.tsCol]).sum(df.longCol)))

My traceback for the error is coming to that line. And stating:

ValueError: Cannot convert column into bool: please use '&' for 'and',     '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

I feel that I must be calling the functions incorrectly?

zero323
  • 322,348
  • 103
  • 959
  • 935
PR102012
  • 846
  • 2
  • 11
  • 30

1 Answers1

6

It is not possible using SQL aggregations but you can easily get the desired result using window functions

import sys
from pyspark.sql.window import Window
from pyspark.sql.functions import sum as sum_

w = (Window()
    .partitionBy(df.strCol, df.tsCol)
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))

df.withColumn("newaggCol", sum_(df.longCol).over(w))
10465355
  • 4,481
  • 2
  • 20
  • 44
zero323
  • 322,348
  • 103
  • 959
  • 935
  • say I want to get quantiles or percentiles on that 'newaggCol', by a new grouping. So I create "new_window=(Window().partitionBy(strcol1,strcol2).orderBy(newaggCol)" THEN, "df.withColumn("pctl",percentRank().over(new_window))" ?? – PR102012 Mar 07 '16 at 18:57
  • If you want the same type of results then window should stay how it is. Although keep in mind these are expensive calls. – zero323 Mar 07 '16 at 19:05
  • What if I want to calculate the discrete Percentile or Quantile (ie, .5, .95) for the numeric col, 'newaggCol'. I know there is a 'percentRank', but that doesn't give dense or or consecutive quantiles obviously. Is there a way to use a UDF? I'm not sure how to use those? – PR102012 Mar 09 '16 at 21:50
  • There is a quantile discretizer but it doesn't work by group.You may find this useful http://stackoverflow.com/q/32827643/1560062 – zero323 Mar 10 '16 at 07:28