Filtering multiple string columns based on 2 different criteria - questions about "grepl" and "starts_with"

Question

I want to create a subset of my data by using the select and filter functions from dplyr. I have consulted a few similar questions about partial string matches and selecting with grepl, but found no solution to my problem.

The columns that I want to filter all start with the same letters, let's say "DGN." So I have DGN1, DGN2, DGN3, etc. all the way up until DGN25. The two criteria I want to filter on are contains "C18" and starts with "153".

Ideally, I would want to run a code chunk that looks like this:

dgn_subset <- df %>%
    select(ID, date, starts_with("DGN") %>%
    filter(grepl("C18"|starts_with("153"), starts_with("DGN")))

There are 2 main issues here -- I don't think that grepl can take "starts_with" as an input for the pattern. Also, it can't take "starts_with" as the column argument (I think it may only be able to filter on one column at a time?).

To get the code to work, I could replace the starts_with("153") portion with "153" and the starts_with("DGN") portion with "DGN1," but that gives me many observations that I do not want and it only filters on the first DGN column.

Are there any alternative functions or packages I can use to solve my problem? Any help is greatly appreciated!

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Note that `grepl` takes a regular expression but won't work with `starts_with` which is a special dplyr thing. And you can use `across()` with `filter()` to check multiple columns at once. — MrFlick, Jul 23 '20 at 21:28
@MrFlick Thank you very much. Quick question - will `across()` filter for the pattern in every single column or any one column? I forgot to mention that I am just looking for the pattern in any one of the 25 columns. — emomura, Jul 23 '20 at 22:07

akrun · Accepted Answer · 2020-07-23T22:20:14.147

0

We can use filter with across. where we loop over the columns using c_across specifying the column name match in select_helpers (starts_with), get a logical output with grepl checking for either "C18" or (|) the number that starts with (^) 153

library(dplyr) #1.0.0
library(stringr)
df %>%
    # // do a row wise grouping
    rowwise() %>%
    # // subset the columns that starts with 'DGN' within c_across
    # // apply grepl condition on the subset
    # // wrap with any for any column in a row meeting the condition
    filter(any(grepl("C18|^153", c_across(starts_with("DGN")))))

Or with filter_at

df %>% 
  # //apply the any_vars along with grepl in filter_at
  filter_at(vars(starts_with("DGN")), any_vars(grepl("C18|^153", .)))

data

df <-  data.frame(ID = 1:3, DGN1 = c("2_C18", 32, "1532"), 
          DGN2 = c("24", "C18_2", "23"))

edited Jul 23 '20 at 22:20

answered Jul 23 '20 at 21:29

akrun

874,273
37
540
662

Thank you! Unfortunately, I am returned with 0 observations (there should be a couple thousand), so I may just specify all of the combinations that start with "153." What exactly does the "~" do here? – emomura Jul 23 '20 at 22:02
@emomura Can you update your post with a small example using dput so that I can test it – akrun Jul 23 '20 at 22:03
@emomura can you check my update. I tested on a small reprodcuible example – akrun Jul 23 '20 at 22:09
1

Works perfectly! Thank you very much. Forgot to mention that I was looking for those criteria in any of the columns, not all - my bad. – emomura Jul 23 '20 at 22:18

score 0 · Answer 2 · answered Jul 24 '20 at 01:24

In base R, you can use startsWith to select the columns that you want to look for, using sapply check for the pattern in those columns. Use rowSums to calculate how many times that pattern occurs in each row and then select the row with at-least one occurrence.

cols <- startsWith(names(df), 'DGN')
df[rowSums(sapply(df[cols], grepl, pattern = 'C18|^153')) > 0, ]

Similar logic but with lapply you can do :

df[Reduce(`|`, lapply(df[cols], grepl, pattern = 'C18|^153')), ]

Filtering multiple string columns based on 2 different criteria - questions about "grepl" and "starts_with"

2 Answers2

data