1

I encountered this question: PHP explode the string, but treat words in quotes as a single word

and similar dealing with using Regex to explode words in a sentence, separated by a space, but keeping quoted text intact (as a single word).

I would like to do the same in R. I have attempted to copy-paste the regular expression into stri_split in the stringi package as well as strsplit in base R, but as I suspect the regular expression uses a format R does not recognize. The error is:

Error: '\S' is an unrecognized escape in character string...

The desired output would be:

mystr <- '"preceded by itself in quotation marks forms a complete sentence" preceded by itself in quotation marks forms a complete sentence'

myfoo(mystr)

[1] "preceded by itself in quotation marks forms a complete sentence" "preceded" "by" "itself" "in" "quotation" "marks" "forms" "a" "complete" "sentence"

Trying: strsplit(mystr, '/"(?:\\\\.|(?!").)*%22|\\S+/') gives:

Error in strsplit(mystr, "/\"(?:\\\\.|(?!\").)*%22|\\S+/") : 
  invalid regular expression '/"(?:\\.|(?!").)*%22|\S+/', reason 'Invalid regexp'
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
AdamO
  • 4,283
  • 1
  • 27
  • 39
  • 1
    Try `\\S` (as for any backslash character for a regex in R) – Tensibai Dec 13 '17 at 15:47
  • 2
    Where is the regular expression that you tried? Why is your code not in this question? – MrFlick Dec 13 '17 at 15:48
  • [`"(?:[^"\\]|\\.)*"|\S+`](https://regex101.com/r/rsbF5i/1)? *Also works with escaped double quotes* – ctwheels Dec 13 '17 at 15:51
  • @ctwheels, your suggested regex provides *exactly the same error* as the OP. It also might be remedied by Tensibai's comment about backslashes. – r2evans Dec 13 '17 at 17:37
  • ashkan, the regex you provided requires perl-like expressions, so if you use `strsplit(..., perl=TRUE)` then you no longer get the error. (It doesn't correctly parse the string the way you want, but it doesn't provide an error.) – r2evans Dec 13 '17 at 17:39
  • @r2evans the OP updated the post. Doesn't mean my answer is incorrect, just means I wasn't presented with all the information to correctly answer the question. My answer does work and does not provide the *exact same error as the OP* assuming the OP properly escapes my regex and uses `perl=TRUE` – ctwheels Dec 13 '17 at 17:39
  • Your suggested regex errors *exactly as the [original question](https://stackoverflow.com/revisions/92d0f5f1-013e-4e40-bdb3-16f853459ed8/view-source)*. The correction is simple: adding the second backslash removes the error. R is non-standard in its use of double-backslashes, certainly. All I was saying was that you had not tested your suggestion in R *and it does not fix the error*. – r2evans Dec 13 '17 at 17:44
  • After the edit it appears to be a well-formed question with an example and desired result. – IRTFM Dec 13 '17 at 18:09
  • 1
    @r2evans [I did test my answer in R and it does work](https://ideone.com/fyv37j). In any case, I think A5C1D2H2I1M1N2O1R2T1's answer is much nicer – ctwheels Dec 13 '17 at 18:24

1 Answers1

4

A simple option would be to use scan:

> x <- scan(what = "", text = mystr)
Read 11 items
> x
 [1] "preceded by itself in quotation marks forms a complete sentence"
 [2] "preceded"                                                       
 [3] "by"                                                             
 [4] "itself"                                                         
 [5] "in"                                                             
 [6] "quotation"                                                      
 [7] "marks"                                                          
 [8] "forms"                                                          
 [9] "a"                                                              
[10] "complete"                                                       
[11] "sentence"  
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485