Spark - Group by HAVING with dataframe syntax?

Question

What's the syntax for using a groupby-having in Spark without an sql/hiveContext? I know I can do

DataFrame df = some_df
df.registreTempTable("df");    
df1 = sqlContext.sql("SELECT * FROM df GROUP BY col1 HAVING some stuff")

but how do I do it with a syntax like

df.select(df.col("*")).groupBy(df.col("col1")).having("some stuff")

This .having() does not seem to exist.

score 52 · Accepted Answer · answered Aug 09 '16 at 11:40

52

Yes, it doesn't exist. You express the same logic with agg followed by where:

df.groupBy(someExpr).agg(somAgg).where(somePredicate)

answered Aug 09 '16 at 11:40

zero323

322,348
103
959
935

score 27 · Answer 2 · edited Dec 29 '21 at 11:58

27

Say for example if I want to find products in each category, having fees less than 3200 and their count must not be less than 10:

SQL query:

sqlContext.sql("select Category,count(*) as 
count from hadoopexam where HadoopExamFee<3200  
group by Category having count>10")

DataFrames API (Pyspark)

from pyspark.sql.functions import *

df.filter(df.HadoopExamFee<3200)
  .groupBy('Category')
  .agg(count('Category').alias('count'))
  .filter(col('count')>10)

edited Dec 29 '21 at 11:58

Community

1
1

answered Jul 03 '19 at 11:53

Karthik

1,143
7
12

It is not working cause count cannot be found in filter() – Stan Jan 05 '22 at 07:34
2

@stanPeng - you likely forgot to run 'from pyspark.sql.functions import *' statement; snippet worked for me – Michael Goltsman Feb 23 '22 at 12:17
I had only 'col' function in my import, and it didn`t work. but when I put a '*' it works. Do you know which other SQL functions I have to include in my import besides 'col'? – Glib Martynenko May 12 '22 at 15:22
@GlibMartynenko, Hi, you need to import count function also(which is coming inside agg) along with col – Karthik May 12 '22 at 23:42
You can use from pyspark.sql.functions import count – Subash Jul 31 '22 at 12:38

Spark - Group by HAVING with dataframe syntax?

2 Answers2

Linked