I have a Spark DataFrame (sdf
) where each row shows an IP visiting a URL. I want to count distinct IP-URL pairs in this data frame and the most straightforward solution is sdf.groupBy("ip", "url").count()
. However, since the data frame has billions of rows, precise counts can take quite a while. I'm not particularly familiar with PySpark -- I tried replacing .count()
with .approx_count_distinct()
, which was syntactically incorrect.
I searched "how to use .approx_count_distinct()
with groupBy()
" and found this answer. However, the solution suggested there (something along those lines: sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count"))
) doesn't seem to give me the counts that I want. The method .approx_count_distinct()
can't take two columns as arguments, so I can't write sdf.agg(F.approx_count_distinct(sdf.ip, sdf.url).alias("distinct_count"))
, either.
My question is, is there a way to get .approx_count_distinct()
to work on multiple columns and count distinct combinations of these columns? If not, is there another function that can do just that and what's an example usage of it?
Thank you so much for your help in advance!