0

I have a regular expression that is able to match my data, using grepl, but I can't figure out how to extract the sub-expressions inside it to new columns.

This is returning the test string as foo, without any of the sub-expressions:

entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+)\\s+(\\d*\\:?\\d+\\.\\d+)"
test <- "101      POULET Laure                               FRA     1992   25-29     E. M. S. Bron Natation          26.00"
m <- regexpr(entryPattern, test)
foo <- regmatches(test, m)

In my real use case, I'm acting on lots of strings similar to test. I'm able to find the correctly formatted ones, so I think the pattern is correct.

rows$isMatch <- grepl(entryPattern, rows$text)

What 'm hoping to do is add the sub-expressions as new columns in the rows dataframe (i.e. rows$rank, rows$name, rows$country, etc.). Thanks in advance for any advice.

carpiediem
  • 1,918
  • 22
  • 41

1 Answers1

0

It seems that regmatches won't do what I want. Instead, I need the stringr package, as suggested by @kent-johnson.

library(stringr)
test <- "101      POULET Laure                               FRA     1992   25-29     E. M. S. Bron Natation          26.00"
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+?)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+?)\\s+(\\d*\\:?\\d+\\.\\d+)"
str_match(test, entryPattern)[1,2:8]

Which outputs:

[1] "101"                            
[2] "POULET Laure"                   
[3] "FRA"                            
[4] "1992"                           
[5] "25-29"                          
[6] "E. M. S. Bron Natation"
[7] "26.00"   
carpiediem
  • 1,918
  • 22
  • 41