1

I'm running n_distinct on a large file (>30GB) and it doesn't appear to produce an exact result.

I have another reference point for the data, and the output is off in the disk frame aggregate.

It mentions in the docs that n_distinct is an exact calculation, not an estimate.

Is that right?

Cauder
  • 2,157
  • 4
  • 30
  • 69
  • In the rather terse help page it mentions that `n_unique` is a faster version of `length(unique(x))`. – Rui Barradas Sep 12 '20 at 17:32
  • I'm not familiar with disk.frame, is it possible that you're computing `n_distinct` for each chunk, so that if a value appears in different chunks it's counted several times? – Alexlok Sep 12 '20 at 21:00
  • 1
    My understanding is that it distincts each chunk and then distincts the full list – Cauder Sep 12 '20 at 21:10

1 Answers1

1

The implementation of n_distinct can be found on this page https://github.com/xiaodaigh/disk.frame/blob/master/R/one-stage-verbs.R

#' @export
#' @rdname one-stage-group-by-verbs
n_distinct_df.chunk_agg.disk.frame <- function(x, na.rm = FALSE, ...) {
  if(na.rm) {
    setdiff(unique(x), NA)
  } else {
    unique(x)
  }
}

#' @export
#' @importFrom dplyr n_distinct
#' @rdname one-stage-group-by-verbs
n_distinct_df.collected_agg.disk.frame <- function(listx, ...) {
  n_distinct(unlist(listx))
}

Now, it looks to be an exact calculation as I intended. The logic is simple, it computes the unique within each chunk, and then n_distinct on result of all chunks once collected.

But I can't rule out if there is a bug elsewhere.

Do you have test cases to show that it's not exactly? Perhaps you can contribute a PR to test?

xiaodai
  • 14,889
  • 18
  • 76
  • 140