I'm using Spark 2.1.1 (pyspark), doing a groupby followed by an approx_count_distinct aggregation on a DataFrame with about 1.4 billion rows. The groupby operation results in about 6 million groups to perform the approx_count_distinct operation on. The expected distinct counts for the groups range from single-digits to the millions.
Here is the code snippet I'm using, with column 'item_id' containing the ID of items, and 'user_id' containing the ID of users. I want to count the distinct users associated with each item.
>>> distinct_counts_df = data_df.groupby(['item_id']).agg(approx_count_distinct(data_df.user_id).alias('distinct_count'))
In the resulting DataFrame, I'm getting about 16,000 items with a count of 0:
>>> distinct_counts_df.filter(distinct_counts_df.distinct_count == 0).count()
16032
When I checked the actual distinct count for a few of these items, I got numbers between 20 and 60. Is this a known issue with the accuracy of the HLL approximate counting algorithm or is this a bug?