Column alias after groupBy in pyspark

Question

I need the resulting data frame in the line below, to have an alias name "maxDiff" for the max('diff') column after groupBy. However, the below line does not makeany change, nor throw an error.

 grpdf = joined_df.groupBy(temp1.datestamp).max('diff').alias("maxDiff")

score 102 · Accepted Answer · edited Oct 11 '18 at 11:56

102

You can use agg instead of calling max method:

from pyspark.sql.functions import max

joined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff"))

Similarly in Scala

import org.apache.spark.sql.functions.max

joined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff"))

or

joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff"))

edited Oct 11 '18 at 11:56

Community

1
1

answered Nov 04 '15 at 14:14

zero323

322,348
103
959
935

4

I prefer this to the accepted answer because it doesn't require one to know in advance the name that Spark will give to the aggregated column. – abeboparebop Apr 14 '17 at 10:58
You do have to be careful, as the following syntax does not work: rmse = df.groupBy('name').agg({'sqerr':'mean'}).alias('MSE') The alias command just seems to be ignored. But this works: df.groupBy('name').agg(F.mean(df.sqerr).alias('MSE')) – user1978816 Jan 29 '20 at 13:30

score 44 · Answer 2 · answered Nov 04 '15 at 08:39

44

This is because you are aliasing the whole DataFrame object, not Column. Here's an example how to alias the Column only:

import pyspark.sql.functions as func

grpdf = joined_df \
    .groupBy(temp1.datestamp) \
    .max('diff') \
    .select(func.col("max(diff)").alias("maxDiff"))

answered Nov 04 '15 at 08:39

Nhor

3,860
6
28
41

1

yes.. this works.. except that func. is not required – mhn Nov 04 '15 at 09:06
4

it's not if you have already imported `pyspark.sql.functions` to your script – Nhor Nov 04 '15 at 09:09
2

The problem is you need to know what spark will clal the new column which might not be obvious. @zero323's anwser is better – user1978816 Jan 29 '20 at 13:32

score 8 · Answer 3 · answered Oct 10 '17 at 23:26

In addition to the answers already here, the following are also convenient ways if you know the name of the aggregated column, where you don't have to import from pyspark.sql.functions:

1

grouped_df = joined_df.groupBy(temp1.datestamp) \
                      .max('diff') \
                      .selectExpr('max(diff) AS maxDiff')

See docs for info on .selectExpr()

2

grouped_df = joined_df.groupBy(temp1.datestamp) \
                      .max('diff') \
                      .withColumnRenamed('max(diff)', 'maxDiff')

See docs for info on .withColumnRenamed()

This answer here goes into more detail: https://stackoverflow.com/a/34077809

**Did you try first code block?** It seems wrong. Spark can not resolve column name with mentioned ways. In your code block, spark try to find **diff** column and try to run **max** function on given set but grouped_data doesn't contain any **diff** column, it contains **temp1.datestamp and max(diff)**. — serkan kucukbay, Oct 25 '17 at 13:36

score 2 · Answer 4 · edited May 08 '20 at 07:21

2

you can use.

grouped_df = grpdf.select(col("max(diff)") as "maxdiff",col("sum(DIFF)") as "sumdiff").show()

edited May 08 '20 at 07:21

David Buck

3,752
35
31
35

answered May 08 '20 at 06:54

Nilay Bhardwaj

41
4

1

This syntax is invalid in pysark. Testing with Version v3.1.2 produces an error. `my_df = spark.createDataFrame([(0, 1, 2, 3), (4, 5, 6, 7)], schema=list('abcd')) ; my_df.show() ; my_df.select(sf.col('a') as ('new_a'))` ---> produces this error: `SyntaxError: invalid syntax File "", line 5 my_df.select(sf.col('a') as ( 'new_a')) ^ SyntaxError: invalid syntax` – Gerard G Jul 04 '22 at 06:03

Column alias after groupBy in pyspark

4 Answers4

Linked