filter variables whose column names contain pattern

Question

I am trying to filter out NA, NaN and Inf values out of a tbl using dyplr's filter function.

The trick is that I only want to apply the filter to columns whose names contain a specific pattern. The pattern is: r1, r2, r3, etc.

I have tried to combine grep and filter to achieve this, but can't get it to work. My current code looks like this:

filter_(!is.na(grep("r[1-9]", colnames(DF), value = TRUE)) 
& !is.infinite(grep("r[1-9]", colnames(DF), value = TRUE)) 
& !is.nan(grep("r[1-9]", colnames(DF), value = TRUE)))

However, this code returns a warning message: "Truncating vector to length 1." And the data returned is unfiltered.

I suspect that it's the is.na functions here that are causing the problem, because I've seen an example online where you can apply grep to filter using a normal condition (i.e. condition == value) and not a condition based on is.na

Can you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your dataset? — Z.Lin, Sep 04 '17 at 16:11
I believe that the problem is that you are testing `colnames(DF)` for `NA`, `NaN` and `Inf`, when it should be the values in those columns. (The ones that match your pattern `r[1-9]`.) — Rui Barradas, Sep 04 '17 at 16:15

pachadotdev · Answer 1 · 2017-09-04T19:05:16.193

dplyr provides matches() that is useful for this

Example 1: How matches() work?

library(dplyr)

# remove columns that start with "mp"
mtcars %>% select(-matches("mp"))

# keep columns that start with "mp"
mtcars %>% select(matches("mp"))

Example 2: Using matches() in the context of your request but using a MWE

# Create a dummy dataset
data = tibble(id = c("John","Paul","George","Ringo"),
              r1 = c(1,2,NA,NA), 
              r2 = c(1,2,NA,4),
              s1 = c(1,NA,3,4))

# Filter NAs in columns that start with r followed by a number
data %>% filter_at(vars(matches("r[0-9]")), all_vars(!is.na(.)))

I've just realised I did not complete the example. Now its complete, @anita-mcgill — pachadotdev, Sep 04 '17 at 19:06

lmo · Answer 2 · 2017-09-04T16:43:05.597

Here is a base R method to filter rows, comparing specific columns.

# sample data
set.seed(1234)
dat <- data.frame(r1=c(NA, 1,NaN, 5, Inf), r2=c(NA, 1,NaN, NA, Inf), d=rnorm(5))

this data set looks like

dat
   r1  r2          d
1  NA  NA -1.2070657
2   1   1  0.2774292
3 NaN NaN  1.0844412
4   5  NA -2.3456977
5 Inf Inf  0.4291247

We will check the first two columns and ignore the third column. Notice that the only row that should remain is row 2.

dat[Reduce("&", lapply(dat[grep("^r", names(dat))], is.finite)),]
  r1 r2         d
2  1  1 0.2774292

Here, a data.frame that is subset using grep to select the appropriate columns (1 and 2) is fed to lapply. The regex "^r" says only include variables whose names that start with "r". In the lapply loop, each vector is checked using is.finite. This function returns FALSE for NA, NaN, and Inf. The resulting list of logical vectors is fed to Reduce` which returns a logical vector the length of the number of rows of the data.frame where an element is TRUE if and only if every element in a row is finite.

eipi10 · Answer 3 · 2017-09-04T17:19:36.527

4

With dplyr, you can use the filter_at function:

dat %>% filter_at(vars(matches("^r[1-9]")), all_vars(is.finite(.)))

Using @lmo's sample data, the result is:

  r1 r2         d
1  1  1 0.2774292

edited Sep 04 '17 at 17:19

answered Sep 04 '17 at 16:52

eipi10

91,525
24
209
285

Hi, this works. But I've now discovered a further issue which can be resolved by only filtering the `Inf` and `NaN` values (and leaving in `NA`). To do this can I replace the second argument with `all_vars(!is.infinite(.)) & all_vars(!is.nan(.))` – Anita McGill Sep 05 '17 at 06:21
@AnitaMcGill above I posted a solution that points to your current problem – pachadotdev Sep 11 '17 at 16:13

filter variables whose column names contain pattern

3 Answers3