0

My dataset codes "Not available" differently depending on the variable (-99, -100, NA). It has 100s of variables so the first step was to look up which columns are affected, in order to recode them appropriately.

EDIT: thanks to @joran and @G. Grothendieck, I got answers pretty quickly. Just to provide a TL;DR: the option with colSums is probably best: fast, succinct and flexible (although its arguments are not so easy to put into a variable?)

  f1 <- function() {colnames(tbl_df[map_lgl(tbl_df, ~any(. == -100, na.rm = TRUE))])}

  f2 <- function() {names(tbl_df)[colSums(tbl_df == -100) > 0]}

  f3 <- function() {colnames(tbl_df[,sapply(tbl_df, function(x) any(x == -100, na.rm = TRUE))])}

  microbenchmark(f1(), f2(), f3(), unit = "relative")
#> Unit: relative
#>  expr      min       lq     mean   median       uq       max neval
#>  f1() 2.924239 2.694531 2.026845 2.578680 2.604190 0.8291649   100
#>  f2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100
#>  f3() 1.113641 1.140000 1.053742 1.167211 1.178409 0.8241631   100

Original post continues here

I've tried to generalise the sapply answer here, and after some trial and error have succeeded with purrr::map... But I don't understand why some of the things I tried do not work, in particular, sapply seems erratic.

Here's a reprex:

library(tidyverse)

set.seed(124)

df <- data.frame(a =  c(sample(1:49, 49),-99, NA),
                    b = c(sample(1:50, 50), -99),
                    c = c(sample(1:50, 50), -100),
                 d = sample(1:51, 51),
                 e = sample(1:51, 51))

# First puzzle: answer in other thread doesn't work with data.frame

colnames(df[,sapply(df, function(x) any(is.na(x)))])
#> NULL

# but works with a tibble

tbl_df <- as.tibble(df)

colnames(tbl_df[,sapply(tbl_df, function(x) any(is.na(x)))])
#> [1] "a"

# However, this doesn't work for any other missing value coding 
# (Edit: it seems to work if there's more than one column??)

colnames(tbl_df[,sapply(tbl_df, function(x) any(x == -99))])
#> [1] "a" "b"

colnames(tbl_df[,sapply(tbl_df, function(x) any(x == -100))])
#> Error in tbl_df[, sapply(tbl_df, function(x) any(x == -100))]: 
#> object of type 'closure' is not subsettable

#(NB: I get "Error: NA column indexes not supported" on my console)

I can imagine this has something to do with the way sapply works but the documentation and answers like this one don't quite cut it for me...

I've come up with the following, which works quite fine for checking values both individually and in groups. I'd welcome any improvements (e.g. keeping the values alongside the columns where they're found).

colnames(tbl_df[unlist(map(tbl_df, ~any(. %in% c(-99, -100, NA))))])
#> [1] "a" "b" "c"

On a side note, I don't really understand why trying to achieve a similar thing in the pipe yielded the wrong thing

tbl_df %>% 
  filter_all(all_vars(. == -99)) %>% 
  colnames()
#> [1] "a" "b" "c" "d" "e"

Sorry if this seems like a motley collection of questions; but I'd appreciate any clarification!

Fons MA
  • 1,142
  • 1
  • 12
  • 21
  • 1
    In your first example you've been bitten by the `drop = FALSE` issue with subsetting data frame columns. If there's only one column, the result is "dropped" down to a single vector, hence no column names. – joran Dec 18 '18 at 22:26
  • 1
    ...in other cases it seems you're forgetting that you might want to use `na.rm = FALSE` in `any()`. And lastly, `filter` functions subset the data frame's rows, not columns, so I would expect any use of `filter` to result in all the columns you started with. – joran Dec 18 '18 at 22:30

1 Answers1

3

1) drop=FALSE Subscripting a data.frame will drop the dimensions for 1d results unless drop = FALSE is used so try this. (tibble subscripting does not drop dimensions.)

# colnames(df[,sapply(df, function(x) any(is.na(x)))])

colnames(df[, sapply(df, function(x) any(is.na(x))), drop = FALSE])
## [1] "a"

or easier:

names(df)[apply(is.na(df), 2, any)]
## [1] "a"

or

names(df)[colSums(is.na(df)) > 0]
## [1] "a"

2) na.rm=TRUE In the next example there is an NA in the first column. If we exclude that we get an answer:

# colnames(tbl_df[,sapply(tbl_df, function(x) any(x == -100))])

colnames(tbl_df[, sapply(tbl_df, function(x) any(x == -100, na.rm = TRUE))])
## [1] "c"

or

names(tbl_df)[colSums(tbl_df == -100, na.rm = TRUE) > 0]
## [1] "c"

or use which

names(tbl_df[, sapply(tbl_df, function(x) length(which(x == -100)) > 0)])
## [1] "c"

or

names(tbl_df)[lengths(lapply(as.data.frame(tbl_df == -100), which)) > 0] 
## [1] "c"

or using which(..., arr.ind = TRUE)

names(tbl_df)[ unique(which(tbl_df == -100, arr.ind = TRUE)[, "col"]) ]
## [1] "c"

3) simplification We can simplify this by factoring out the generic part that does not depend on the data into is.bad:

# colnames(tbl_df[unlist(map(tbl_df, ~any(. %in% c(-99, -100, NA))))])

is.bad <- function(x) any(x %in% c(-99, -100, NA))

names(tbl_df)[ sapply(tbl_df, is.bad) ]
## [1] "a" "b" "c"

or

Filter(function(x) is.bad(tbl_df[[x]]), names(tbl_df))
## [1] "a" "b" "c"

or for a different approach:

names(tbl_df)[colSums(is.na(tbl_df) | tbl_df == -99 | tbl_df == -100) > 0]
## [1] "a" "b" "c"

4) select_if filter_all with all_vars goes row by row and picks out those rows for which all the columns satisfy the condition. You want to go column by column, not row by row. Use select_if instead:

tbl_df %>%
  select_if(~ any(. == -99)) %>%
  names
## [1] "a" "b"
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341