4

I have a Spark DataFrame (sdf) where each row shows an IP visiting a URL. I want to count distinct IP-URL pairs in this data frame and the most straightforward solution is sdf.groupBy("ip", "url").count(). However, since the data frame has billions of rows, precise counts can take quite a while. I'm not particularly familiar with PySpark -- I tried replacing .count() with .approx_count_distinct(), which was syntactically incorrect.

I searched "how to use .approx_count_distinct() with groupBy()" and found this answer. However, the solution suggested there (something along those lines: sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count"))) doesn't seem to give me the counts that I want. The method .approx_count_distinct() can't take two columns as arguments, so I can't write sdf.agg(F.approx_count_distinct(sdf.ip, sdf.url).alias("distinct_count")), either.

My question is, is there a way to get .approx_count_distinct() to work on multiple columns and count distinct combinations of these columns? If not, is there another function that can do just that and what's an example usage of it?

Thank you so much for your help in advance!

data_bear
  • 41
  • 4

2 Answers2

0

Group with expressions and alias as needed. Lets try:

df.groupBy("ip", "url").agg(expr("approx_count_distinct(ip)").alias('ip_count'),expr("approx_count_distinct(url)").alias('url_count')).show()
wwnde
  • 26,119
  • 6
  • 18
  • 32
0

Your code sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count")) will give a value of 1 to every group since you are counting the value of one of the grouping column; url.

If you want to count distinct of IP-URL pairs using approx_count_distinct function, you can compound them in an array then apply the function. It would be something like this

sdf.selectExpr("approx_count_distinct(array(ip, url)) as distinct_count")
AdibP
  • 2,819
  • 1
  • 10
  • 24