Extract a regular expression match

Question

I'm trying to extract a number from a string.

And do something like [0-9]+ on the string "aaa12xxx" and get "12".

I thought it would be something like:

> grep("[0-9]+", "aaa12xxx", value=TRUE)
[1] "aaa12xxx"

And then I figured...

> sub("[0-9]+", "\\1", "aaa12xxx")
[1] "aaaxxx"

But I got some form of response doing:

> sub("[0-9]+", "ARGH!", "aaa12xxx")
[1] "aaaARGH!xxx"

There's a small detail I'm missing.

hadley · Accepted Answer · 2010-06-10T01:09:59.900

187

Use the new stringr package which wraps all the existing regular expression operates in a consistent syntax and adds a few that are missing:

library(stringr)
str_locate("aaa12xxx", "[0-9]+")
#      start end
# [1,]     4   5
str_extract("aaa12xxx", "[0-9]+")
# [1] "12"

edited Jun 10 '10 at 01:09

answered Feb 03 '10 at 14:46

hadley

102,019
32
183
245

5

(almost) exactly what I needed, but as I started typing in `?str_extract` I saw `str_extract_all` and life was good again. – dwanderson Jun 22 '17 at 21:36

score 121 · Answer 2 · answered May 28 '14 at 01:44

121

It is probably a bit hasty to say 'ignore the standard functions' - the help file for ?gsub even specifically references in 'See also':

‘regmatches’ for extracting matched substrings based on the results of ‘regexpr’, ‘gregexpr’ and ‘regexec’.

So this will work, and is fairly simple:

txt <- "aaa12xxx"
regmatches(txt,regexpr("[0-9]+",txt))
#[1] "12"

answered May 28 '14 at 01:44

thelatemail

91,185
12
128
188

2

How do you extract multiple groups? For example, extract separately 12 and 15 from the string "aaa12bbb15ccc" ? – Duccio A Aug 13 '21 at 11:52
3

@DuccioA - `regmatches(x, gregexpr("[0-9]+", x))` - like `sub` is for one replacement, and `gsub` is for all replacements, `regexpr` finds one result, while `gregexpr` finds all results. – thelatemail Aug 13 '21 at 23:43

Marek · Answer 3 · 2021-07-09T08:31:44.037

29

For your specific case you could remove all not numbers:

gsub("[^0-9]", "", "aaa12xxxx")
# [1] "12"

It won't work in more complex cases

gsub("[^0-9]", "", "aaa12xxxx34")
# [1] "1234"

edited Jul 09 '21 at 08:31

answered Feb 03 '10 at 14:00

Marek

49,472
15
99
121

Not the best option for extracting a target from a string. This is good for just returning any digits in a string that may or may not be together by removing all characters that are not digits and could create misses if you think it extracts (e.g., gsub("[^0-9]", "", "aaa12xx1xx" returns 121 instead what might be expected c(12, 1)) – daneshjai Jul 04 '21 at 14:10
@daneshjai This is exacly what OP wants. It is not generalized sollution. – Marek Jul 05 '21 at 12:10
Not necessarily. The title of the question is "Extract a regular expression match". It fits the example, but it may give the wrong impression and in some instances contrary results. So I think it's helping for others that end up landing here and may be new to regex to clarify that this is deleting all characters and not a pattern to extract a target. – daneshjai Jul 09 '21 at 03:25
@daneshjai Most of answers returns 12 for "aaa12xx1xx" which is not what you expect. – Marek Jul 09 '21 at 08:33

score 17 · Answer 4 · answered Feb 03 '10 at 19:34

17

You can use PERL regexs' lazy matching:

> sub(".*?([0-9]+).*", "\\1", "aaa12xx99",perl=TRUE)
[1] "12"

Trying to substitute out non-digits will lead to an error in this case.

answered Feb 03 '10 at 19:34

Jyotirmoy Bhattacharya

9,317
3
29
38

5

Do not need PERL if you are willing to use the slightly uglier "[^0-9]*([0-9]+).*" – Jyotirmoy Bhattacharya Feb 04 '10 at 03:29

score 6 · Answer 5 · answered Apr 20 '14 at 21:53

Use capturing parentheses in the regular expression and group references in the replacement. Anything in parentheses gets remembered. Then they're accessed by \2, the first item. The first backslash escapes the backslash's interpretation in R so that it gets passed to the regular expression parser.

gsub('([[:alpha:]]+)([0-9]+)([[:alpha:]]+)', '\\2', "aaa12xxx")

score 5 · Answer 6 · answered Feb 03 '10 at 14:08

One way would be this:

test <- regexpr("[0-9]+","aaa12456xxx")

Now, notice regexpr gives you the starting and ending indices of the string:

    > test
[1] 4
attr(,"match.length")
[1] 5

So you can use that info with substr function

substr("aaa12456xxx",test,test+attr(test,"match.length")-1)

I'm sure there is a more elegant way to do this, but this was the fastest way I could find. Alternatively, you can use sub/gsub to strip out what you don't want to leave what you do want.

score 4 · Answer 7 · answered Jun 20 '16 at 13:14

One important difference between these approaches the the behaviour with any non-matches. For example, the regmatches method may not return a string of the same length as the input if there is not a match in all positions

> txt <- c("aaa12xxx","xyz")

> regmatches(txt,regexpr("[0-9]+",txt)) # could cause problems

[1] "12"

> gsub("[^0-9]", "", txt)

[1] "12" ""  

> str_extract(txt, "[0-9]+")

[1] "12" NA

score 3 · Answer 8 · answered Aug 09 '20 at 20:48

3

A solution for this question

library(stringr)
str_extract_all("aaa12xxx", regex("[[:digit:]]{1,}"))
# [[1]]
# [1] "12"

[[:digit:]]: digit [0-9]

{1,}: Matches at least 1 times

answered Aug 09 '20 at 20:48

Tho Vu

1,304
2
8
20

score 2 · Answer 9 · answered Jun 14 '10 at 04:20

Using strapply in the gsubfn package. strapply is like apply in that the args are object, modifier and function except that the object is a vector of strings (rather than an array) and the modifier is a regular expression (rather than a margin):

library(gsubfn)
x <- c("xy13", "ab 12 cd 34 xy")
strapply(x, "\\d+", as.numeric)
# list(13, c(12, 34))

This says to match one or more digits (\d+) in each component of x passing each match through as.numeric. It returns a list whose components are vectors of matches of respective components of x. Looking the at output we see that the first component of x has one match which is 13 and the second component of x has two matches which are 12 and 34. See http://gsubfn.googlecode.com for more info.

score 2 · Answer 10 · answered Oct 30 '15 at 14:47

2

Another solution:

temp = regexpr('\\d', "aaa12xxx");
substr("aaa12xxx", temp[1], temp[1]+attr(temp,"match.length")[1])

answered Oct 30 '15 at 14:47

pari

788
8
12

score 1 · Answer 11 · answered Nov 06 '19 at 11:16

Using the package unglue we would do the following:

# install.packages("unglue")
library(unglue)
unglue_vec(c("aaa12xxx", "aaaARGH!xxx"), "{prefix}{number=\\d+}{suffix}", var = "number")
#> [1] "12" NA

^{Created on 2019-11-06 by the reprex package (v0.3.0)}

Use the convert argument to convert to a number automatically :

unglue_vec(
  c("aaa12xxx", "aaaARGH!xxx"), 
  "{prefix}{number=\\d+}{suffix}", 
  var = "number", 
  convert = TRUE)
#> [1] 12 NA

jan-glx · Answer 12 · 2023-07-12T15:27:34.787

0

While you said, you want to extract "12" from "aaa12xxx", it seems that you actually want 12. In such cases, strcapture from the preinstalled utils package is a very safe & powerful solution:

strcapture(pattern = "[^\\d]*(\\d+)[^\\d]*", x = "aaa12xxx", proto = list(my_val = integer()), perl = TRUE)
#>   my_val
#> 1     12

^{Created on 2023-07-12 by the reprex package (v2.0.1)}

edited Jul 12 '23 at 15:27

answered Jul 12 '23 at 15:22

jan-glx

7,611
2
43
63

In contrast to `stringi`/`stringr` based solutions it allows you to use the more powerful `PCRE `. – jan-glx Jul 12 '23 at 15:27

score -2 · Answer 13 · 2017-05-16T19:29:41.010

You could write your regex functions with C++, compile them into a DLL and call them from R.

    #include <regex>

    extern "C" {
    __declspec(dllexport)
    void regex_match( const char **first, char **regexStr, int *_bool)
    {
        std::cmatch _cmatch;
        const char *last = *first + strlen(*first);
        std::regex rx(*regexStr);
        bool found = false;
        found = std::regex_match(*first,last,_cmatch, rx);
        *_bool = found;
    }

__declspec(dllexport)
void regex_search_results( const char **str, const char **regexStr, int *N, char **out )
{
    std::string s(*str);
    std::regex rgx(*regexStr);
    std::smatch m;

    int i=0;
    while(std::regex_search(s,m,rgx) && i < *N) {
        strcpy(out[i],m[0].str().c_str());
        i++;
        s = m.suffix().str();
    }
}
    };

call in R as

dyn.load("C:\\YourPath\\RegTest.dll")
regex_match <- function(str,regstr) {
.C("regex_match",x=as.character(str),y=as.character(regstr),z=as.logical(1))$z }

regex_match("abc","a(b)c")

regex_search_results <- function(x,y,n) {
.C("regex_search_results",x=as.character(x),y=as.character(y),i=as.integer(n),z=character(n))$z }

regex_search_results("aaa12aa34xxx", "[0-9]+", 5)

This is completely unnecessary. See the answers of "thelatemail" or "Robert" for an easy solution inside R. — Daniel Hoop, Sep 06 '18 at 10:46

Extract a regular expression match

13 Answers13

Linked

Related