lookbehind in str_extract with R

Question

I have the following text file

[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)
[01/29/14 16:42:57, 10.100.120.120, unknown]: spatial_monitor: Alan left Conference Room (Zone Role contains Person role)
[01/29/14 16:43:00, 10.100.120.120, unknown]: spatial_monitor: Kurt entered Conference Room (Computer desk contains Person role)
[01/29/14 16:43:02, 10.100.120.120, unknown]: spatial_monitor: Kurt left Conference Room (Computer desk contains Person role)
[01/29/14 16:43:03, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)
[01/29/14 16:43:08, 10.100.120.120, unknown]: spatial_monitor: Alan left Conference Room (Zone Role contains Person role)
[01/29/14 16:46:07, 10.100.120.120, unknown]: spatial_monitor: Fred entered Conference Room (Zone Role contains Person role)
[01/29/14 16:46:08, 10.100.120.120, unknown]: spatial_monitor: Fred left Conference Room (Zone Role contains Person role)

I am trying to use str_extract in R (in library stringr) to extract the names of locations ("Conference Room" in example above). The logic is to pull the portion of string which follows the words "entered" or "left". To this end, i have the following regular expression

(?<=entered\s)[A-Z][a-z]+\s[A-Z][a-z]+

This works fine in Notepad++, however when i embed this in R, i get the following error

> tt <- "[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)"
> str_extract(tt, '(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+')
Error in regexpr("(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+", "[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)",  : 
  invalid regular expression '(?<=entered\s)[A-Z][a-z]+\s[A-Z][a-z]+', reason 'Invalid regexp'

Other answers tell me that lookahead and lookbehind only work with Perl. So the question is how to enable Perl with str_extract? Or is there a better way of doing this? Thanks in advance.

This works and does not use lookahead/lookbehind. Parenthesize the portion to be extracted as shown: `library(gsubfn); strapplyc(tt, 'entered\\s([A-Z][a-z]+\\s[A-Z][a-z]+)', simplify = TRUE)` — G. Grothendieck, Feb 06 '14 at 15:55

score 4 · Accepted Answer · edited May 06 '21 at 13:20

4

library(stringr)
tt <- "[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)"
str_extract(tt, perl('(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+'))
# [1] "Conference Room"

Update: With stringr 1.3.0 2018-02-19, perl() was removed. You can now simply do str_extract(tt, '(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+').

edited May 06 '21 at 13:20

CubicInfinity

163
1
10

answered Feb 06 '14 at 15:39

lukeA

53,097
5
97
100

score 3 · Answer 2 · answered Feb 06 '14 at 15:37

Your regex is valid. It works with sub if you specify perl = TRUE. You can also use the sub function for your task:

sub('.*(?<=entered\\s)([A-Z][a-z]+\\s[A-Z][a-z]+).*', '\\1', tt, perl = TRUE)
# [1] "Conference Room"

Alternatively, without perl:

sub('.*entered\\s([A-Z][a-z]+\\s[A-Z][a-z]+).*', '\\1', tt)
# [1] "Conference Room"

lookbehind in str_extract with R

2 Answers2