0

This is my sample data:

id <- c("1a","2c","3d","4f","5g","6e","7f","8q","9r","10v","11x","12l")
dg1 <- c(111234,902754,111235,111236,113456,123754,288563,396186,987654,987865,288563,396186)
dg2 <-c("",111235,111236,113456,"","","","","","",902754,902754)
df<-cbind(id,dg1,dg2)

I'd like to create a new column that indicates whether a string (111 or 113) is present across multiple columns such that my final df looks like this:

dt1 <- c(1,1,1,1,1,0,0,0,0,0,0,0)
df <- cbind(df,dt1)

I've tried to do this:

df %>%
  filter(any_vars(grepl('^(113|111)')))

and

df %>%
  select(contains("113","111"))

with the intention of tagging the filtered rows with mutate then doing a left join into the original data frame.

I've tried to Frankenstein together some code from similar problems to no avail here and here. I'm trying to avoid this solution because my actual data has over a hundred columns to sort through with dozens of strings to search through (trying to avoid typing out and potentially missing a combination).

Can anyone help?

thou
  • 35
  • 4

2 Answers2

0

One way to do is.

Data:

id <- c("1a","2c","3d","4f","5g","6e","7f","8q","9r","10v","11x","12l")
dg1 <- c(111234,902754,111235,111236,113456,123754,288563,396186,987654,987865,288563,396186)
dg2 <-c("",111235,111236,113456,"","","","","","",902754,902754)
df<-cbind.data.frame(id,dg1,dg2)

Code:

    df = df %>% mutate(d1= str_match(df$dg1, '^(113|111)') %>% .[,1] %>% {ifelse(.> 0, 1,0 )}) %>% 
  mutate(d2 = str_match(df$dg2, '^(113|111)') %>% .[,1] %>% {ifelse(.> 0, 1,0 )})%>% 
  mutate(dt1 = ifelse(d1 > 0 | d2 > 0, 1,0)) %>% 
  select(id, dg1, dg2, dt1)

    id    dg1    dg2 dt1
1   1a 111234          1
2   2c 902754 111235   1
3   3d 111235 111236   1
4   4f 111236 113456   1
5   5g 113456          1
6   6e 123754         NA
7   7f 288563         NA
8   8q 396186         NA
9   9r 987654         NA
10 10v 987865         NA
11 11x 288563 902754  NA
12 12l 396186 902754  NA
Nad Pat
  • 3,129
  • 3
  • 10
  • 20
  • Hello! Thank you for the quick reply. Is there a way to go through dg1-99 without typing out each line of code? – thou Mar 21 '22 at 16:31
0

Another option would be:

Check whether either of dg1 or dg2 begin with 111 or 113 using grepl()


df <- df %>% mutate(dt1= ifelse(grepl("^111|^113", dg1) == TRUE |
                                grepl("^111|^113", dg2) == TRUE, 1, 0))

Output

> df
    id    dg1    dg2 dt1
1   1a 111234          1
2   2c 902754 111235   1
3   3d 111235 111236   1
4   4f 111236 113456   1
5   5g 113456          1
6   6e 123754          0
7   7f 288563          0
8   8q 396186          0
9   9r 987654          0
10 10v 987865          0
11 11x 288563 902754   0
12 12l 396186 902754   0

Test if equal to your desired output

dt1 <- c(1,1,1,1,1,0,0,0,0,0,0,0)
df1 <- cbind(df,dt1)

identical(df, df1)

> identical(df, df1)
[1] TRUE

As requested: to select multiple column names:

df %>% rowwise() %>% mutate(dt1 = ifelse(any(grepl("^111|^113", across(dg1:dg99))),1,0))
Gnueghoidune
  • 1,237
  • 4
  • 13
  • Thank you so much for the quick response! Is there any way to do this if there's dg1-dg99 without typing out 100 lines of code? – thou Mar 21 '22 at 16:21
  • It would be preferable if instead of using column names I could use column positions. For example: df %>% mutate(dt1= ifelse(grepl("^111|^113", [,c(2,3)] == TRUE, 1, 0)) – thou Mar 21 '22 at 16:50
  • @thou See the update – Gnueghoidune Mar 21 '22 at 16:57
  • Thank you for the update! Any ideas why the multiple column names take a lot longer to process than your original answer? It's much faster to process: (grepl("^111|^113", dg1)==TRUE | .... copied and pasted a hundred times with changing "dg1" to desired variable rather than the across(dg1:dg99). – thou Mar 21 '22 at 21:02
  • I cannot say for sure, but I know that some dplyr verbs (including `across`) had performance problems, at least in the past. – Gnueghoidune Mar 21 '22 at 22:43
  • 1
    Just as a heads up, figured out that writing a function with: "ifelse(any(grepl("^111|^113", x)),1,0)}" that's used in "mutate(dt = apply(.[,2:3], MARGIN = 1, FUNCTION))" was much faster than using across in this instance. Copying and pasting dg1:dg99 a hundred times was the fastest. – thou Mar 24 '22 at 05:25
  • Thanks for the remark! Will keep that in mind if I ever run into the same problem :) – Gnueghoidune Mar 24 '22 at 05:40