1

I have the following text file

[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)
[01/29/14 16:42:57, 10.100.120.120, unknown]: spatial_monitor: Alan left Conference Room (Zone Role contains Person role)
[01/29/14 16:43:00, 10.100.120.120, unknown]: spatial_monitor: Kurt entered Conference Room (Computer desk contains Person role)
[01/29/14 16:43:02, 10.100.120.120, unknown]: spatial_monitor: Kurt left Conference Room (Computer desk contains Person role)
[01/29/14 16:43:03, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)
[01/29/14 16:43:08, 10.100.120.120, unknown]: spatial_monitor: Alan left Conference Room (Zone Role contains Person role)
[01/29/14 16:46:07, 10.100.120.120, unknown]: spatial_monitor: Fred entered Conference Room (Zone Role contains Person role)
[01/29/14 16:46:08, 10.100.120.120, unknown]: spatial_monitor: Fred left Conference Room (Zone Role contains Person role)

I am trying to use str_extract in R (in library stringr) to extract the names of locations ("Conference Room" in example above). The logic is to pull the portion of string which follows the words "entered" or "left". To this end, i have the following regular expression

(?<=entered\s)[A-Z][a-z]+\s[A-Z][a-z]+

This works fine in Notepad++, however when i embed this in R, i get the following error

> tt <- "[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)"
> str_extract(tt, '(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+')
Error in regexpr("(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+", "[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)",  : 
  invalid regular expression '(?<=entered\s)[A-Z][a-z]+\s[A-Z][a-z]+', reason 'Invalid regexp'

Other answers tell me that lookahead and lookbehind only work with Perl. So the question is how to enable Perl with str_extract? Or is there a better way of doing this? Thanks in advance.

Community
  • 1
  • 1
Bala Deshpande
  • 165
  • 1
  • 2
  • 8
  • 1
    This works and does not use lookahead/lookbehind. Parenthesize the portion to be extracted as shown: `library(gsubfn); strapplyc(tt, 'entered\\s([A-Z][a-z]+\\s[A-Z][a-z]+)', simplify = TRUE)` – G. Grothendieck Feb 06 '14 at 15:55

2 Answers2

4
library(stringr)
tt <- "[01/29/14 16:42:55, 10.100.120.120, unknown]: spatial_monitor: Alan entered Conference Room (Zone Role contains Person role)"
str_extract(tt, perl('(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+'))
# [1] "Conference Room"

Update: With stringr 1.3.0 2018-02-19, perl() was removed. You can now simply do str_extract(tt, '(?<=entered\\s)[A-Z][a-z]+\\s[A-Z][a-z]+').

CubicInfinity
  • 163
  • 1
  • 10
lukeA
  • 53,097
  • 5
  • 97
  • 100
3

Your regex is valid. It works with sub if you specify perl = TRUE. You can also use the sub function for your task:

sub('.*(?<=entered\\s)([A-Z][a-z]+\\s[A-Z][a-z]+).*', '\\1', tt, perl = TRUE)
# [1] "Conference Room"

Alternatively, without perl:

sub('.*entered\\s([A-Z][a-z]+\\s[A-Z][a-z]+).*', '\\1', tt)
# [1] "Conference Room"
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168