I have a dataframe with multiple columns of strings. I want to
- look at all strings in a series of these columns
- see if any strings match an input pattern, ideally
regex
- create a single new binary column that is equal to 1 or TRUE if there are any matches across all columns for a given row, and 0 or FALSE if there are no matches whatsoever.
My question has two parts.
First, I am using sapply
and grepl
and have managed to create a dataframe of TRUE/FALSE values for all of the columns in question, but am having trouble figuring out how to get to the "create a new single column" step. I tried rowwise
mutate but that's throwing errors that I honestly don't understand!
df <- data.frame(
idx = 1:5,
column_b = letters[1:5],
column_c = c('abc', 'abc', 'def', 'def', 'ghi'),
column_d = c('def', 'def', 'def', 'def', 'def'),
column_e = c('ghi', 'ghi', 'ghi', 'abc', 'ghi')
)
apply_factor <- function(df, factor, col_low, col_high, pattern) {
df %>%
rowwise() %>%
mutate(factor = sum(c_across(as.data.frame(sapply(select(df, {{col_low}}:{{col_high}}), grepl, pattern={{pattern}})))), na.rm = TRUE)
}
apply_factor(df, factor = 'abc', 'column_c', 'column_e', pattern = "^abc")
(double curly braces {{}}
in the function due to something or other about dplyr
I saw in another question...)
Console responds:
Error in `mutate()`:
! Problem while computing `factor = sum(...)`.
i The error occurred in row 1.
Caused by error in `as_indices_impl()`:
! Must subset columns with a valid subscript vector.
x Subscript has the wrong type `data.frame<
column_c: logical
column_d: logical
column_e: logical
>`.
i It must be numeric or character.
If I run just the as.data.frame() part it gives me something that looks pretty good. I just need to make it into a new column!
> as.data.frame(sapply(select(df, 'column_c':'column_e'), grepl, pattern = "^abc"))
column_c column_d column_e
1 TRUE FALSE FALSE
2 TRUE FALSE FALSE
3 FALSE FALSE FALSE
4 FALSE FALSE TRUE
5 FALSE FALSE FALSE
Second, I need to extend this function to a much larger dataset with many millions of rows and between 30 and 100 string columns. The large and variable number of columns, by the way, is why I need this function to be able to take any range of columns instead of listing out every possible column. Since processing time will be somewhat of a factor: is what I am doing here going to be massively inefficient? Is there an easier, faster way to accomplish this?
I've tried variations on sum, summarise
, and any, but honestly it's very likely I did it wrong.
I also looked at this post, but I can't limit to using %in%
as I need regex (I'm pretty sure I can't use regex with %in%...)
(I am relearning R after a long stint away in Stata-land, so please forgive any really obvious blunders or terminology weirdness!)