pyspark - how to find sum of rows filtered

Question

I have a dataset with two columns: Country, Adclicks. How do I find the country with the most ad clicks?

Country | Ad Click
USA       1
USA       0
USA       1
PR        0
PR        0
PR        1

1) `groupBy` the country + `sum` 2) `agg` w/ `max` on the clicks — OneCricketeer, Nov 30 '21 at 20:20

score 0 · Accepted Answer · answered Nov 30 '21 at 20:52

Assuming your DataFrame if defined as the variable "df" then something like this:

import pyspark.sql.functions as psf

# Get aggregate sum
s = df.groupby("Country").agg({'Ad Click': 'sum'})

# Get and display top country
s.registerTempTable("sums_table")
query = """
    SELECT Country
    FROM sums_table
    WHERE `sum(Ad Click)` = (
        SELECT MAX(`sum(Ad Click)`)
        FROM sums_table)
"""
top_country = spark.sql(query).collect()
print(top_country[0]["Country"])

pyspark - how to find sum of rows filtered

1 Answers1