R Regular Expressions: How do I get the full matched string

Question

I am trying to get the full RegEx match out from R, but I can only seem to get the first portion of the string.

Using http://regexpal.com/ I can confirm that my RegEx is good and that it matches what I expect. In my data, the "error type" is found between the number preceded by an asterisk and the next comma. So I'm looking to return "*20508436572 access forbidden by rule" in the first instance and "*20508436572 some_error" in the second.

Example:

library(stringr)

regex.errortype<-'\\*\\d+\\s[^,\\n]+'
test_string1<-'2014/08/07 08:28:56 [error] 21278#0: *20508436572 access forbidden by rule, client: 111.222.111.222'
test_string2<-'2014/08/07 08:28:56 [error] 21278#0: *20508436572 some_error, client: 111.222.111.222'

str_extract(test_string1, regex.errortype)
str_extract_all(test_string1, regex.errortype)
regmatches(test_string, regexpr(regex.errortype, test_string1))

str_extract(test_string2, regex.errortype)
str_extract_all(test_string2, regex.errortype)
regmatches(test_string2, regexpr(regex.errortype, test_string2))

Results:

> str_extract(test_string1, regex.errortype)
[1] "*20508436572 access forbidde"
> str_extract_all(test_string1, regex.errortype)
[[1]]
[1] "*20508436572 access forbidde"

> regmatches(test_string1, regexpr(regex.errortype, test_string1))
[1] "*20508436572 access forbidde"

> str_extract(test_string2, regex.errortype)
[1] "*20508436572 some_error"
> str_extract_all(test_string2, regex.errortype)
[[1]]
[1] "*20508436572 some_error"

> regmatches(test_string2, regexpr(regex.errortype, test_string2))
[1] "*20508436572 some_error"

As you can see, the longer match is truncated, but the shorter one is correctly parsed.

Am I missing something here, or is there some other method to get the full match back?

Cheers,

Andy.

Your regexp expression captures the n in forbidden and not a newline, which i suppose you thought. — Jörg Mäder, Aug 08 '14 at 10:38
Just looked at that by moving the "n" and this appears to be true. Do you know if that is a probalem with RegEx in R, as "\n" has nothing to do with "n"? — Andy Crellin, Aug 08 '14 at 10:50
I normally use gsub for regexp, and there it makes no difference, but using the stringr packages it depends if there are a even or odd number of backslashes used. If I have similar problems I often use try-and-error to find the right solution ;-) — Jörg Mäder, Aug 08 '14 at 11:00

score 2 · Accepted Answer · answered Aug 08 '14 at 10:34

2

 str_extract_all(test_string1, perl("(?<=\\#[0-9]\\: )\\*\\d+\\s[^,\\n]+"))[[1]]
#[1] "*20508436572 access forbidden by rule"

str_extract_all(test_string2, perl("(?<=\\#[0-9]\\: )\\*\\d+\\s[^,\\n]+"))[[1]]
#[1] "*20508436572 some_error"

Using Lookbehind

(?<=\\# Look for #

[0-9] followed by a number

\\: followed by : and a space

Then used your pattern

answered Aug 08 '14 at 10:34

akrun

874,273
37
540
662

Great - worked a treat. What is the function of the "(?<=\\#[0-9]\\: )" string in this example? – Andy Crellin Aug 08 '14 at 10:42
I'm not sure why applying that lookbehind returned the full string - I need to apply this to other parts of my data where that lookbehind won't be valid. – Andy Crellin Aug 08 '14 at 10:46
@Andy Crellin. It is regex Look behind. Check this link – akrun Aug 08 '14 at 10:51
@Andy Crellin. Without the `\\n`, I am getting the expected output. `str_extract_all(test_string1, '\\*\\d+\\s[^,]+')` – akrun Aug 08 '14 at 10:58
akrun, @Jörg Mäder: Thanks for both your comments. Simply replacing \n with [:cntrl:] seems to have solved the problem as well. – Andy Crellin Aug 08 '14 at 11:11

Rich Scriven · Answer 2 · 2014-08-08T10:54:44.677

Here's a gsub method that removes your desired string in both cases, without re-writing the regular expression.

> gsub("((.*)[*])|([,](.*))", "", c(test_string1, test_string2))
# [1] "20508436572 access forbidden by rule" 
# [2] "20508436572 some_error"

In the regular expression ((.*)[*])|([,](.*)),

((.*)[*]) removes everything up to the * character.
| means "or"
([,](.*)) removes the comma, and everything after it.

R Regular Expressions: How do I get the full matched string

2 Answers2