Regex group capture in R with multiple capture-groups

Question

In R, is it possible to extract group capture from a regular expression match? As far as I can tell, none of grep, grepl, regexpr, gregexpr, sub, or gsub return the group captures.

I need to extract key-value pairs from strings that are encoded thus:

\((.*?) :: (0\.[0-9]+)\)

I can always just do multiple full-match greps, or do some outside (non-R) processing, but I was hoping I can do it all within R. Is there's a function or a package that provides such a function to do this?

score 131 · Accepted Answer · answered Apr 06 '12 at 03:13

131

str_match(), from the stringr package, will do this. It returns a character matrix with one column for each group in the match (and one for the whole match):

> s = c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)")
> str_match(s, "\\((.*?) :: (0\\.[0-9]+)\\)")
     [,1]                         [,2]       [,3]          
[1,] "(sometext :: 0.1231313213)" "sometext" "0.1231313213"
[2,] "(moretext :: 0.111222)"     "moretext" "0.111222"

answered Apr 06 '12 at 03:13

Kent Johnson

3,320
1
22
23

2

and `str_match_all()` to match all groups in a regex – smci Mar 26 '14 at 15:49
How can I just print only the captured groups for [,1] ? – nosh Feb 25 '19 at 22:03
1

Not sure what you are looking for. The captured groups are columns 2 & 3. `[,1]` is the full match. `[,2:3]` is the captured groups. – Kent Johnson Feb 27 '19 at 01:21

score 67 · Answer 2 · answered Jun 04 '09 at 22:44

67

gsub does this, from your example:

gsub("\\((.*?) :: (0\\.[0-9]+)\\)","\\1 \\2", "(sometext :: 0.1231313213)")
[1] "sometext 0.1231313213"

you need to double escape the \s in the quotes then they work for the regex.

Hope this helps.

answered Jun 04 '09 at 22:44

David Lawrence Miller

1,801
11
12

Actually I need to pull out the captured substrings to put in a data.frame. But, looking at your answer, I guess I could chain gsub and a couple of strsplit's to get what I want, maybe: strsplit(strsplit(gsub(regex, "\\1::\\2::::", str), "::::")[[1]], "::") – Daniel Dickison Jun 05 '09 at 16:03
13

Great. The R `gsub` manpage very badly needs an example showing you need '\\1' to escape a capture-group reference. – smci Mar 26 '14 at 15:51

score 45 · Answer 3 · edited Aug 24 '17 at 01:15

45

Try regmatches() and regexec():

regmatches("(sometext :: 0.1231313213)",regexec("\\((.*?) :: (0\\.[0-9]+)\\)","(sometext :: 0.1231313213)"))
[[1]]
[1] "(sometext :: 0.1231313213)" "sometext"                   "0.1231313213"

edited Aug 24 '17 at 01:15

Artem Klevtsov

9,193
6
52
57

answered May 15 '13 at 11:32

jeales

593
4
5

7

Thanks for the vanilla R solution and for pointing out `regmatches` which I've never seen before – Andy Oct 14 '15 at 03:05
Why would you have to write the string twice? – Stefano Borini Oct 15 '19 at 14:13
1

@StefanoBorini `regexec` returns a list holding information regarding only the location of the matches, hence `regmatches` requires the user to provide the string the match list belonged to. – RTbecard Jun 15 '20 at 15:12
@andy wait until you hear about [strcapture](https://stackoverflow.com/a/45851537/1870254) – jan-glx Mar 15 '23 at 11:44

score 20 · Answer 4 · answered Apr 26 '11 at 21:43

gsub() can do this and return only the capture group:

However, in order for this to work, you must explicitly select elements outside your capture group as mentioned in the gsub() help.

(...) elements of character vectors 'x' which are not substituted will be returned unchanged.

So if your text to be selected lies in the middle of some string, adding .* before and after the capture group should allow you to only return it.

gsub(".*\\((.*?) :: (0\\.[0-9]+)\\).*","\\1 \\2", "(sometext :: 0.1231313213)") [1] "sometext 0.1231313213"

score 7 · Answer 5 · answered Aug 24 '17 at 01:22

7

Solution with strcapture from the utils:

x <- c("key1 :: 0.01",
       "key2 :: 0.02")
strcapture(pattern = "(.*) :: (0\\.[0-9]+)",
           x = x,
           proto = list(key = character(), value = double()))
#>    key value
#> 1 key1  0.01
#> 2 key2  0.02

answered Aug 24 '17 at 01:22

Artem Klevtsov

9,193
6
52
57

1

This is the right way to do stuff like this. Allows using PCRE and forces you to be explicit about expected column types & names. – jan-glx Mar 15 '23 at 11:58

ruffbytes · Answer 6 · 2015-01-29T17:45:14.927

I like perl compatible regular expressions. Probably someone else does too...

Here is a function that does perl compatible regular expressions and matches the functionality of functions in other languages that I am used to:

regexpr_perl <- function(expr, str) {
  match <- regexpr(expr, str, perl=T)
  matches <- character(0)
  if (attr(match, 'match.length') >= 0) {
    capture_start <- attr(match, 'capture.start')
    capture_length <- attr(match, 'capture.length')
    total_matches <- 1 + length(capture_start)
    matches <- character(total_matches)
    matches[1] <- substr(str, match, match + attr(match, 'match.length') - 1)
    if (length(capture_start) > 1) {
      for (i in 1:length(capture_start)) {
        matches[i + 1] <- substr(str, capture_start[[i]], capture_start[[i]] + capture_length[[i]] - 1)
      }
    }
  }
  matches
}

Daniel Dickison · Answer 7 · 2009-06-05T16:21:02.840

This is how I ended up working around this problem. I used two separate regexes to match the first and second capture groups and run two gregexpr calls, then pull out the matched substrings:

regex.string <- "(?<=\\().*?(?= :: )"
regex.number <- "(?<= :: )\\d\\.\\d+"

match.string <- gregexpr(regex.string, str, perl=T)[[1]]
match.number <- gregexpr(regex.number, str, perl=T)[[1]]

strings <- mapply(function (start, len) substr(str, start, start+len-1),
                  match.string,
                  attr(match.string, "match.length"))
numbers <- mapply(function (start, len) as.numeric(substr(str, start, start+len-1)),
                  match.number,
                  attr(match.number, "match.length"))

+1 for a working code. However, I'd rather run a quick shell command from R and use a Bash one-liner like this `expr "xyx0.0023xyxy" : '[^0-9]*\([.0-9]\+\)'` — Aleksandr Levchuk, Sep 01 '11 at 23:18

Megatron · Answer 8 · 2018-09-11T13:30:56.860

As suggested in the stringr package, this can be achieved using either str_match() or str_extract().

Adapted from the manual:

library(stringr)

strings <- c(" 219 733 8965", "329-293-8753 ", "banana", 
             "239 923 8115 and 842 566 4692",
             "Work: 579-499-7527", "$1000",
             "Home: 543.355.3679")
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"

Extracting and combining our groups:

str_extract_all(strings, phone, simplify=T)
#      [,1]           [,2]          
# [1,] "219 733 8965" ""            
# [2,] "329-293-8753" ""            
# [3,] ""             ""            
# [4,] "239 923 8115" "842 566 4692"
# [5,] "579-499-7527" ""            
# [6,] ""             ""            
# [7,] "543.355.3679" ""

Indicating groups with an output matrix (we're interested in columns 2+):

str_match_all(strings, phone)
# [[1]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "219 733 8965" "219" "733" "8965"
# 
# [[2]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "329-293-8753" "329" "293" "8753"
# 
# [[3]]
#      [,1] [,2] [,3] [,4]
# 
# [[4]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "239 923 8115" "239" "923" "8115"
# [2,] "842 566 4692" "842" "566" "4692"
# 
# [[5]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "579-499-7527" "579" "499" "7527"
# 
# [[6]]
#      [,1] [,2] [,3] [,4]
# 
# [[7]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "543.355.3679" "543" "355" "3679"

Thanks for catching the omission. Corrected using the `_all` suffix for the relevant `stringr` functions. — Megatron, Sep 11 '18 at 13:31

score 1 · Answer 9 · answered Nov 06 '19 at 12:04

This can be done using the package unglue, taking the example from the selected answer:

# install.packages("unglue")
library(unglue)

s <- c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)")
unglue_data(s, "({x} :: {y})")
#>          x            y
#> 1 sometext 0.1231313213
#> 2 moretext     0.111222

Or starting from a data frame

df <- data.frame(col = s)
unglue_unnest(df, col, "({x} :: {y})",remove = FALSE)
#>                          col        x            y
#> 1 (sometext :: 0.1231313213) sometext 0.1231313213
#> 2     (moretext :: 0.111222) moretext     0.111222

you can get the raw regex from the unglue pattern, optionally with named capture :

unglue_regex("({x} :: {y})")
#>             ({x} :: {y}) 
#> "^\\((.*?) :: (.*?)\\)$"

unglue_regex("({x} :: {y})",named_capture = TRUE)
#>                     ({x} :: {y}) 
#> "^\\((?<x>.*?) :: (?<y>.*?)\\)$"

More info : https://github.com/moodymudskipper/unglue/blob/master/README.md

Regex group capture in R with multiple capture-groups

9 Answers9

Linked

Related