The implementation of n_distinct
can be found on this page https://github.com/xiaodaigh/disk.frame/blob/master/R/one-stage-verbs.R
#' @export
#' @rdname one-stage-group-by-verbs
n_distinct_df.chunk_agg.disk.frame <- function(x, na.rm = FALSE, ...) {
if(na.rm) {
setdiff(unique(x), NA)
} else {
unique(x)
}
}
#' @export
#' @importFrom dplyr n_distinct
#' @rdname one-stage-group-by-verbs
n_distinct_df.collected_agg.disk.frame <- function(listx, ...) {
n_distinct(unlist(listx))
}
Now, it looks to be an exact calculation as I intended. The logic is simple, it computes the unique
within each chunk, and then n_distinct
on result of all chunks once collected.
But I can't rule out if there is a bug elsewhere.
Do you have test cases to show that it's not exactly? Perhaps you can contribute a PR to test?