My dataset codes "Not available" differently depending on the variable (-99, -100, NA). It has 100s of variables so the first step was to look up which columns are affected, in order to recode them appropriately.
EDIT: thanks to @joran and @G. Grothendieck, I got answers pretty quickly. Just to provide a TL;DR: the option with colSums
is probably best: fast, succinct and flexible (although its arguments are not so easy to put into a variable?)
f1 <- function() {colnames(tbl_df[map_lgl(tbl_df, ~any(. == -100, na.rm = TRUE))])}
f2 <- function() {names(tbl_df)[colSums(tbl_df == -100) > 0]}
f3 <- function() {colnames(tbl_df[,sapply(tbl_df, function(x) any(x == -100, na.rm = TRUE))])}
microbenchmark(f1(), f2(), f3(), unit = "relative")
#> Unit: relative
#> expr min lq mean median uq max neval
#> f1() 2.924239 2.694531 2.026845 2.578680 2.604190 0.8291649 100
#> f2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100
#> f3() 1.113641 1.140000 1.053742 1.167211 1.178409 0.8241631 100
Original post continues here
I've tried to generalise the sapply
answer here, and after some trial and error have succeeded with purrr::map
... But I don't understand why some of the things I tried do not work, in particular, sapply
seems erratic.
Here's a reprex:
library(tidyverse)
set.seed(124)
df <- data.frame(a = c(sample(1:49, 49),-99, NA),
b = c(sample(1:50, 50), -99),
c = c(sample(1:50, 50), -100),
d = sample(1:51, 51),
e = sample(1:51, 51))
# First puzzle: answer in other thread doesn't work with data.frame
colnames(df[,sapply(df, function(x) any(is.na(x)))])
#> NULL
# but works with a tibble
tbl_df <- as.tibble(df)
colnames(tbl_df[,sapply(tbl_df, function(x) any(is.na(x)))])
#> [1] "a"
# However, this doesn't work for any other missing value coding
# (Edit: it seems to work if there's more than one column??)
colnames(tbl_df[,sapply(tbl_df, function(x) any(x == -99))])
#> [1] "a" "b"
colnames(tbl_df[,sapply(tbl_df, function(x) any(x == -100))])
#> Error in tbl_df[, sapply(tbl_df, function(x) any(x == -100))]:
#> object of type 'closure' is not subsettable
#(NB: I get "Error: NA column indexes not supported" on my console)
I can imagine this has something to do with the way sapply
works but the documentation and answers like this one don't quite cut it for me...
I've come up with the following, which works quite fine for checking values both individually and in groups. I'd welcome any improvements (e.g. keeping the values alongside the columns where they're found).
colnames(tbl_df[unlist(map(tbl_df, ~any(. %in% c(-99, -100, NA))))])
#> [1] "a" "b" "c"
On a side note, I don't really understand why trying to achieve a similar thing in the pipe yielded the wrong thing
tbl_df %>%
filter_all(all_vars(. == -99)) %>%
colnames()
#> [1] "a" "b" "c" "d" "e"
Sorry if this seems like a motley collection of questions; but I'd appreciate any clarification!