Is n_distinct an exact calculation with disk frames?

Question

I'm running n_distinct on a large file (>30GB) and it doesn't appear to produce an exact result.

I have another reference point for the data, and the output is off in the disk frame aggregate.

It mentions in the docs that n_distinct is an exact calculation, not an estimate.

Is that right?

In the rather terse help page it mentions that `n_unique` is a faster version of `length(unique(x))`. — Rui Barradas, Sep 12 '20 at 17:32
I'm not familiar with disk.frame, is it possible that you're computing `n_distinct` for each chunk, so that if a value appears in different chunks it's counted several times? — Alexlok, Sep 12 '20 at 21:00
My understanding is that it distincts each chunk and then distincts the full list — Cauder, Sep 12 '20 at 21:10

score 1 · Accepted Answer · answered Sep 17 '20 at 02:11

The implementation of n_distinct can be found on this page https://github.com/xiaodaigh/disk.frame/blob/master/R/one-stage-verbs.R

#' @export
#' @rdname one-stage-group-by-verbs
n_distinct_df.chunk_agg.disk.frame <- function(x, na.rm = FALSE, ...) {
  if(na.rm) {
    setdiff(unique(x), NA)
  } else {
    unique(x)
  }
}

#' @export
#' @importFrom dplyr n_distinct
#' @rdname one-stage-group-by-verbs
n_distinct_df.collected_agg.disk.frame <- function(listx, ...) {
  n_distinct(unlist(listx))
}

Now, it looks to be an exact calculation as I intended. The logic is simple, it computes the unique within each chunk, and then n_distinct on result of all chunks once collected.

But I can't rule out if there is a bug elsewhere.

Do you have test cases to show that it's not exactly? Perhaps you can contribute a PR to test?

Is n_distinct an exact calculation with disk frames?

1 Answers1