Group string values in column by their beginning

Question

I have a dataframe:

ID    value
1     request body: <?xml version="2.0"> values received
2     request body: <code> 'jnwg3425'
3     request body: <?xml version="2.0", <PlatCode>, <code> 'qwefn2'
4     Error in message received
5     Error in message received
6     Push forward message x3535
7     Push forward message <MarkCheckMSG>

I want to group values in second column by similarity in the begining os string. How could a get a dataframe with patterns of each group like this:

    patterns
request body:
Error in message received
Push forward message

How could i do that? What methods are better suit my goal? Should i use regular expressions or maybe string distance methods?

Abdessabour Mtk · Accepted Answer · 2020-11-01T23:12:42.267

First we extract the first 3 words or 2 words that are followed by :, using stringr::str_extract or you could just use sub to match the full value and only capture the given expression i.e sub('^(expre).+$', '\\1', value) , the regex pattern is as follows \w+ \w+(:| \w+) i.e match two words \w+ \w+ then either match : or another word.

library(stringr)
df %>% 
    mutate(beginnings= str_extract(value, "\\w+ \\w+(:| \\w+)")) %>%
    group_by(beginnings)

# A tibble: 7 x 3
# Groups:   beginnings [3]
     ID value                                                     beginnings    
  <int> <fct>                                                      <chr>               
1     1 request body: <?xml version=2.0> values received           request body:       
2     2 request body: <code> jnwg3425                              request body:       
3     3 request body: <?xml version=2.0, <PlatCode>, <code> qwefn2 request body:       
4     4 Error in message received                                  Error in message    
5     5 Error in message received                                  Error in message    
6     6 Push forward message x3535                                 Push forward message
7     7 Push forward message <MarkCheckMSG>                        Push forward message

Using a different regular expression

(\w+ )+[a-z]{2,}:? => match as much words followed by space as possible ((\w+ )+) followed by more then two letters [a-z]{2,} and : if it exists.

df %>%
   mutate(beginings= str_extract(value, "(\\w+ )+[a-z]{2,}:?")) %>%
   group_by(beginings)

# A tibble: 7 x 3
# Groups:   beginings [3]
     ID value                                                      beginings                
  <int> <fct>                                                      <chr>                    
1     1 request body: <?xml version=2.0> values received           request body:            
2     2 request body: <code> jnwg3425                              request body:            
3     3 request body: <?xml version=2.0, <PlatCode>, <code> qwefn2 request body:            
4     4 Error in message received                                  Error in message received
5     5 Error in message received                                  Error in message received
6     6 Push forward message x3535                                 Push forward message     
7     7 Push forward message <MarkCheckMSG>                        Push forward message

thanks, and what if i have a row with unique value, like . It will return NA in column beginings, not — , Nov 02 '20 at 07:56
you could either add another `mutate` coupled with an `if_else` Or adding this `|<\w+>` to the regex. notice that due yo `R`things you need to escape the `\` — Abdessabour Mtk, Nov 02 '20 at 11:33
and how can i write that regular expression not only for strings but for any symbols like * for example? — , Nov 06 '20 at 21:21
@reredf instead of using the `\w` character class use your defined character class `[\w*&^%]` amd amy other symbol you want to match, notice that in `R` you need to escape the `\ ` — Abdessabour Mtk, Nov 06 '20 at 21:48
escape the backslash \ not the opening bracket i.e like `[\\w*&^%]`, the way R handles strings is by accepting `\t` as a tab and `\n` as a newline so anything following a backslash means something special to `R` that's why using the regex `\w` which means match a word character needs to be written when using R ofc as `\\w` — Abdessabour Mtk, Nov 06 '20 at 22:50

Group string values in column by their beginning

1 Answers1