6

I have asked related questions HERE and HERE. I tried to generalize these answers but have failed.

Basically I have a string I want to split into words, numbers and any sort of punctuation, yet, I want to retain the apostrophes. Here is what I've tried and I'm so close (I think):

x <- "Raptors don't like robots! I'd pay $500.00 to rid them."

strsplit(x, "(\\s+)|(?=[[:punct:]])", perl = TRUE)

## [[1]]
##  [1] "Raptors" "don"     "'"       "t"       "like"    "robots"  "!"             
##  [8] ""   "I"   "'"    "d"  "pay"     "$"       "500"     "."       "00"      "to"         
## [20] "rid"   "them"    "."  

Here's what I'm after:

## [[1]]
##  [1] "Raptors" "don't"       "like"    "robots"  "!"       ""        "I'd"      
##  [8] "pay"     "$"       "500"   "."   "00"  "to"      "rid"     "them"    "."  

While I want a base solution I would like to see other solutions (I'm sure someone has a stringr solution) which makes the question more generalizable to others.

Note: R has a specific regex system. You will want to be familiar with R to answer this question.

Community
  • 1
  • 1
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • 1
    (Curious) What's specific about R's regex flavour? – Jongware Mar 06 '14 at 21:05
  • I'm confused how the first question you link to is not exactly the same as this one? – eddi Mar 06 '14 at 21:19
  • @Jongware, there are issues with escaping special characters for example. – Tyler Rinker Mar 06 '14 at 21:31
  • @eddi The first question removed the characters, here I'm not removing them, I want them. I used info from those 2 questions to get me as far as I can (similar but not identical). – Tyler Rinker Mar 06 '14 at 21:31
  • @TylerRinker can you illustrate with an example? For your current example: `identical(strsplit(x, "[[:space:]]|(?=[^'[:^punct:]])", perl=TRUE), strsplit(x, "(\\s+)|(?!')(?=[[:punct:]])", perl = TRUE)) # [1] TRUE ` – eddi Mar 06 '14 at 21:36
  • @eddi You are correct. I was testing with a comma,hence why I thought they were different situations, but this is untrue as the other solution was explicitly not splitting on commas. I voted to close. – Tyler Rinker Mar 06 '14 at 21:45

1 Answers1

5

You could use a negative lookahead (?!'):

strsplit(x, "(\\s+)|(?!')(?=[[:punct:]])", perl = TRUE)
#  [1] "Raptors" "don't"   "like"    "robots"  "!"       ""        "I'd"     "pay"     "$"       "500"     "."       "00"      "to"      "rid"     "them"    "."
sgibb
  • 25,396
  • 3
  • 68
  • 74