PySpark 2.1.1 groupby + approx_count_distinct giving counts of 0

Question

I'm using Spark 2.1.1 (pyspark), doing a groupby followed by an approx_count_distinct aggregation on a DataFrame with about 1.4 billion rows. The groupby operation results in about 6 million groups to perform the approx_count_distinct operation on. The expected distinct counts for the groups range from single-digits to the millions.

Here is the code snippet I'm using, with column 'item_id' containing the ID of items, and 'user_id' containing the ID of users. I want to count the distinct users associated with each item.

>>> distinct_counts_df = data_df.groupby(['item_id']).agg(approx_count_distinct(data_df.user_id).alias('distinct_count'))

In the resulting DataFrame, I'm getting about 16,000 items with a count of 0:

>>> distinct_counts_df.filter(distinct_counts_df.distinct_count == 0).count()
16032

When I checked the actual distinct count for a few of these items, I got numbers between 20 and 60. Is this a known issue with the accuracy of the HLL approximate counting algorithm or is this a bug?

mayank agrawal · Answer 1 · 2017-10-05T15:19:55.343

0

Although I am not sure where the actual problem lies, but since approx_count_distinct relies on approximation(https://stackoverflow.com/a/40889920/7045987), HLL may well be the issue.

You can try this:

There is a parameter 'rsd' which you can pass in approx_count_distinct which determines the error margin. If rsd = 0, it will give you accurate results although the time increases significantly and in that case, countDistinct becomes a better option. Nevertheless, you can try decreasing rsd to say 0.008 at the cost of increasing time. This may help in giving a little more accurate results.

edited Oct 05 '17 at 15:19

answered Oct 05 '17 at 15:13

mayank agrawal

2,495
2
13
32

Thanks for the suggestion. I tried with rsd=0.008. It took about 10x as long to run, but I'm getting the same number of 0's in the results so it didn't fix the issue. – mmcc Oct 10 '17 at 12:07
Then I guess results will remain approximate to this extent. approx_count_distinct will not give to the point correct results in almost most cases. See this https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html – mayank agrawal Oct 10 '17 at 14:07
@mayankagrawal that link wasn't working for me. But [this link works](https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html?utm_source=twitterfeed&utm_medium=twitter) – Wassadamo Apr 09 '22 at 01:57

PySpark 2.1.1 groupby + approx_count_distinct giving counts of 0

1 Answers1

Linked