15

I have a need to split on words and end marks (punctuation of certain types). Oddly pipe ("|") can count as an end mark. I have code that words on end marks until I try to add the pipe. Adding the pipe makes the strsplit every character. Escaping it causes and error. How can I include the pipe int he regular expression?

x <- "I like the dog|."

strsplit(x, "[[:space:]]|(?=[.!?*-])", perl=TRUE)
#[[1]]
#[1] "I"    "like" "the"  "dog|" "."   

strsplit(x, "[[:space:]]|(?=[.!?*-\|])", perl=TRUE)
#Error: '\|' is an unrecognized escape in character string starting "[[:space:]]|(?=[.!?*-\|"

The outcome I'd like:

#[[1]]
#[1] "I"    "like" "the"  "dog"  "|"  "."  #pipe is an element
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • I am always hesitant to put regex tags on R regex questions because you get regexers from other languages and though the answers are similar they don't overlap. – Tyler Rinker Oct 17 '12 at 19:01

2 Answers2

19

One way to solve this is to use the \Q...\E notation to remove the special meaning of any of the characters in .... As it says in ?regex:

If you want to remove the special meaning from a sequence of characters, you can do so by putting them between ‘\Q’ and ‘\E’. This is different from Perl in that ‘$’ and ‘@’ are handled as literals in ‘\Q...\E’ sequences in PCRE, whereas in Perl, ‘$’ and ‘@’ cause variable interpolation.

For example:

> strsplit(x, "[[:space:]]|(?=[\\Q.!?*-|\\E])", perl=TRUE)
[[1]]
[1] "I"    "like" "the"  "dog"  "|"    "."
Blue Magister
  • 13,044
  • 5
  • 38
  • 56
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
12

The problem is actually your hyphen, which should come either first or last:

strsplit(x, "[[:space:]]|(?=[|.!?*-])", perl=TRUE)
strsplit(x, "[[:space:]]|(?=[.|!?*-])", perl=TRUE)
strsplit(x, "[[:space:]]|(?=[.!|?*-])", perl=TRUE)
strsplit(x, "[[:space:]]|(?=[-|.!?*])", perl=TRUE)

and so on should all give you the output you are looking for.

You can also escape the hyphen if you prefer, but remember to use two backslashes!

strsplit(x, "[[:space:]]|(?=[.!?*\\-|])", perl=TRUE)
Community
  • 1
  • 1
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485