strsplit inconsistent with gregexpr

Question

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.

So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?

#  We would like to split on the first comma and
#  the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"

#  Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34"  "56"  "78"  "90" 


#  Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )

# Matching positions are at
unlist(m)
[1]  4 13

#  And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","

Huh?! What is going on?

my wild hypothesis is that `strsplit` works iteratively/recursively on the remainder of string after each split - i.e. `"123,34,56,78,90"` -> `["123", "34,56,78,90"]` -> `["123", "34", "56,78,90"]`?? *i know nothing about r, but it's a testable hypothesis until you have a better one :) or it can always be just a plain simple bug in implementation, you could try with different versions or r..* — Aprillion, May 31 '14 at 11:34
@deathApril I think it's actually to do with global vs. non-global replacement. I assume that `strsplit` performs non-global splitting at the first match, as you get the same result using `regexpr` instead of `gregexpr`. I wonder if it is possible to make `strsplit` do global matching...? — Simon O'Hanlon, May 31 '14 at 11:36
The expression: `'/^\w+\K,|,(?=\w+$)/'` splits the string correctly using 'preg_split()' in PHP. Looks like this may indicate an issue with the r implementation of PCRE. — ridgerunner, May 31 '14 at 12:30
Not sure, but I think this might have something to do with [this Q/A](http://stackoverflow.com/q/15575221/559784) from Josh. — Arun, May 31 '14 at 12:59

score 10 · Accepted Answer · edited May 23 '17 at 12:20

The theory of @Aprillion is exact, from R documentation:

The algorithm applied to each input string is

repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}

In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)

To simply illustrate this behavior:

> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""

Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to @JoshO'Brien for the link.)

Exactly right. [Here's a related answer](http://stackoverflow.com/a/15578980/980833) to a similar question about R's `strsplit()`. — Josh O'Brien, May 31 '14 at 17:49

strsplit inconsistent with gregexpr

1 Answers1

Linked