If I want to group_by
and filter
those with any NA
or factor
value in a dataset, I want to use any
function within dplyr
but finding it slow to run for NAs
or factor
(but not for finding any numeric value). Example data:
library(tidyverse)
set.seed(10)
df <- data.frame( group = rep((paste("g", seq(1, 50000, 1), sep = "" )), each =500, length.out = 2500000),
binary = rbinom(2500000, 1, 0.5),
narow = rep(letters[1:26], each = 2, length.out = 2500000))
df <- df %>%
dplyr::mutate(narow = replace(narow, row_number() == 345 | row_number() == 77777, NA) )
str(df)
#'data.frame': 2500000 obs. of 3 variables:
#$ group : Factor w/ 5000 levels "g1","g10","g100",..: 1 1 1 1 1 1 1 1 1 1 ...
#$ binary: int 1 0 0 1 0 0 0 0 1 0 ...
#$ narow : Factor w/ 26 levels "a","b","c","d",..: 1 1 2 2 3 3 4 4 5 5 ...
Now lets group_by
and extract those groups with any
binary==1
:
system.time(
dfnew <- df %>%
group_by(group) %>%
filter(any(binary == 1))
)
# user system elapsed
# 0.1 0.0 0.1
This runs quickly but when I do the same thing for finding any NAs
it is very slow (I have a much bigger dataset):
system.time(
dfnew <- df %>%
group_by(group) %>%
filter(any(is.na(narow)))
)
# user system elapsed
# 5.25 8.49 13.75
This seems extremely slow considering it is quick for the previous code which is very similar (1 vs 13.75s). Is this to be expected or am I doing something wrong? I would like to continue to use any
function as I find it intuitive.
EDIT
It seems to go beyond just NAs
. If I filter
any
factor variable I get a slow response too:
system.time(
dfnew <- df %>%
group_by(group) %>%
filter(any(narow == "a"))
)
user system elapsed
5.32 7.45 12.83