3

If I want to group_by and filter those with any NA or factor value in a dataset, I want to use any function within dplyr but finding it slow to run for NAs or factor (but not for finding any numeric value). Example data:

library(tidyverse)    
set.seed(10)
    df <- data.frame( group = rep((paste("g", seq(1, 50000, 1), sep = "" )), each =500, length.out = 2500000),
                      binary = rbinom(2500000, 1, 0.5),
                      narow = rep(letters[1:26], each = 2, length.out = 2500000))
    df <- df %>% 
      dplyr::mutate(narow = replace(narow, row_number() == 345 | row_number() == 77777, NA) )

    str(df)
        #'data.frame':  2500000 obs. of  3 variables:
        #$ group : Factor w/ 5000 levels "g1","g10","g100",..: 1 1 1 1 1 1 1 1 1 1 ...
        #$ binary: int  1 0 0 1 0 0 0 0 1 0 ...
        #$ narow : Factor w/ 26 levels "a","b","c","d",..: 1 1 2 2 3 3 4 4 5 5 ...

Now lets group_by and extract those groups with any binary==1:

system.time(
  dfnew <- df %>% 
    group_by(group) %>% 
    filter(any(binary == 1))
)
# user  system elapsed 
# 0.1     0.0     0.1

This runs quickly but when I do the same thing for finding any NAs it is very slow (I have a much bigger dataset):

system.time(
  dfnew <- df %>% 
    group_by(group) %>% 
    filter(any(is.na(narow)))
  )
# user  system elapsed 
# 5.25    8.49   13.75 

This seems extremely slow considering it is quick for the previous code which is very similar (1 vs 13.75s). Is this to be expected or am I doing something wrong? I would like to continue to use any function as I find it intuitive.

EDIT

It seems to go beyond just NAs. If I filter any factor variable I get a slow response too:

system.time(
   dfnew <- df %>% 
     group_by(group) %>% 
     filter(any(narow == "a"))
 )
   user  system elapsed 
   5.32    7.45   12.83 
user63230
  • 4,095
  • 21
  • 43

1 Answers1

3

As @NelsonGon mention, anyNA is the function to use in your case.

The problem has already been mentioned here : https://stackoverflow.com/a/35713234/10580543

For the binary exemple, any will be satisfy at the first occurence of binary == 1 while is.na will go though the entire vector, here of length 2500000.

filter(anyNA(narow)) should be much faster than filter(any(is.na(narow))

Edit : in practice the gain in time is very limited (4% faster) for factor.

However, converting factor in character makes the filtering very fast (about 100 times faster). The explanation of the "why" here if you are interested : https://stackoverflow.com/a/34865113/10580543

If you are not interested in ordering levels, the use of characters instead of factors for categorical variables is usually prefered, and way more efficient.

tom
  • 725
  • 4
  • 17
  • but this is just as slow I think? (12.55s) I have edited my question - it seems that `any` is as slow for extracting a `factor` too? – user63230 Aug 13 '19 at 09:26
  • Thanks for the links. So, the answer to my question is that this speed is to be expected and there is not a quicker way to do it using `any`? Interesting! – user63230 Aug 14 '19 at 08:48
  • 1
    @user63230 The quickest way is to perform the filtering on character vector, for this step of data management you do not need to have them as factor anyway. You can still convert them back to factor later. – tom Aug 14 '19 at 09:13