1

I am trying to get the full RegEx match out from R, but I can only seem to get the first portion of the string.

Using http://regexpal.com/ I can confirm that my RegEx is good and that it matches what I expect. In my data, the "error type" is found between the number preceded by an asterisk and the next comma. So I'm looking to return "*20508436572 access forbidden by rule" in the first instance and "*20508436572 some_error" in the second.

Example:

library(stringr)

regex.errortype<-'\\*\\d+\\s[^,\\n]+'
test_string1<-'2014/08/07 08:28:56 [error] 21278#0: *20508436572 access forbidden by rule, client: 111.222.111.222'
test_string2<-'2014/08/07 08:28:56 [error] 21278#0: *20508436572 some_error, client: 111.222.111.222'

str_extract(test_string1, regex.errortype)
str_extract_all(test_string1, regex.errortype)
regmatches(test_string, regexpr(regex.errortype, test_string1))

str_extract(test_string2, regex.errortype)
str_extract_all(test_string2, regex.errortype)
regmatches(test_string2, regexpr(regex.errortype, test_string2))

Results:

> str_extract(test_string1, regex.errortype)
[1] "*20508436572 access forbidde"
> str_extract_all(test_string1, regex.errortype)
[[1]]
[1] "*20508436572 access forbidde"

> regmatches(test_string1, regexpr(regex.errortype, test_string1))
[1] "*20508436572 access forbidde"

> str_extract(test_string2, regex.errortype)
[1] "*20508436572 some_error"
> str_extract_all(test_string2, regex.errortype)
[[1]]
[1] "*20508436572 some_error"

> regmatches(test_string2, regexpr(regex.errortype, test_string2))
[1] "*20508436572 some_error"

As you can see, the longer match is truncated, but the shorter one is correctly parsed.

Am I missing something here, or is there some other method to get the full match back?

Cheers,

Andy.

oguz ismail
  • 1
  • 16
  • 47
  • 69
  • Your regexp expression captures the n in forbidden and not a newline, which i suppose you thought. – Jörg Mäder Aug 08 '14 at 10:38
  • Just looked at that by moving the "n" and this appears to be true. Do you know if that is a probalem with RegEx in R, as "\n" has nothing to do with "n"? – Andy Crellin Aug 08 '14 at 10:50
  • I normally use gsub for regexp, and there it makes no difference, but using the stringr packages it depends if there are a even or odd number of backslashes used. If I have similar problems I often use try-and-error to find the right solution ;-) – Jörg Mäder Aug 08 '14 at 11:00

2 Answers2

2
 str_extract_all(test_string1, perl("(?<=\\#[0-9]\\: )\\*\\d+\\s[^,\\n]+"))[[1]]
#[1] "*20508436572 access forbidden by rule"

str_extract_all(test_string2, perl("(?<=\\#[0-9]\\: )\\*\\d+\\s[^,\\n]+"))[[1]]
#[1] "*20508436572 some_error"

Using Lookbehind

(?<=\\# Look for #

[0-9] followed by a number

\\: followed by : and a space

Then used your pattern

akrun
  • 874,273
  • 37
  • 540
  • 662
  • Great - worked a treat. What is the function of the "(?<=\\#[0-9]\\: )" string in this example? – Andy Crellin Aug 08 '14 at 10:42
  • I'm not sure why applying that lookbehind returned the full string - I need to apply this to other parts of my data where that lookbehind won't be valid. – Andy Crellin Aug 08 '14 at 10:46
  • @Andy Crellin. It is regex Look behind. Check this link – akrun Aug 08 '14 at 10:51
  • @Andy Crellin. Without the `\\n`, I am getting the expected output. `str_extract_all(test_string1, '\\*\\d+\\s[^,]+')` – akrun Aug 08 '14 at 10:58
  • akrun, @Jörg Mäder: Thanks for both your comments. Simply replacing \n with [:cntrl:] seems to have solved the problem as well. – Andy Crellin Aug 08 '14 at 11:11
0

Here's a gsub method that removes your desired string in both cases, without re-writing the regular expression.

> gsub("((.*)[*])|([,](.*))", "", c(test_string1, test_string2))
# [1] "20508436572 access forbidden by rule" 
# [2] "20508436572 some_error"   

In the regular expression ((.*)[*])|([,](.*)),

  • ((.*)[*]) removes everything up to the * character.
  • | means "or"
  • ([,](.*)) removes the comma, and everything after it.
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245