3

I'm trying to split a string in R using strsplit and a perl regex. The string consists of various alphanumeric tokens separated by periods or hyphens, e.g "WXYZ-AB-A4K7-01A-13B-J29Q-10". I want to split the string:

  • wherever a hyphen appears.
  • wherever a period appears.
  • between the second and third character of a token that is exactly 3 characters long and consists of 2 digits followed by 1 capital letter, e.g "01A" produces ["01", "A"] (but "012A", "B1A", "0A1", and "01A2" are not split).

For example, "WXYZ-AB-A4K7-01A-13B-J29Q-10" should produce ["WXYZ", "AB", "01", "A", "13", "B", "J29Q", "10"].

My current regex is ((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-] and it works perfectly in this online regex tester.

Furthermore, the two parts of the alternative, ((?<=[-.]\\d{2})(?=[A-Z][-.])) and [.-], both serve to split the string as intended in R, when they are used separately:

#correctly splits on periods and hyphens
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB"   "A4K7" "01A"  "13B"  "J29Q" "10"

#correctly splits tokens where a letter follows two digits
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))", perl=T)
[[1]]
[1] "WXYZ-AB-A4K7-01" "A-13"            "B-J29Q-10"

But when I try and combine them using an alternative, the second regex stops working, and the string is only split on periods and hyphens:

#only second alternative is used
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB"   "A4K7" "01A"  "13B"  "J29Q" "10"

Why is this happening? Is it a problem with my regex, or with strsplit? How can I achieve the desired behavior?

Desired output:

## [[1]]
## [1] "WXYZ" "AB"   "A4K7" "01"   "A"    "13"   "B"    "J29Q" "10"
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
ApproachingDarknessFish
  • 14,133
  • 7
  • 40
  • 79
  • For clarification, would `2285C` be split into `2285` and `C`? If not, I need to edit my answer. – Rich Scriven Jan 25 '17 at 22:14
  • @RichScriven No it would not, the tokens should only be split at digit-letter boundaries if they match exactly `\d\d[A-Z]` and are of length 3. – ApproachingDarknessFish Jan 25 '17 at 22:17
  • @RichScriven, just throw in a boundary? `"[-.]|(?<=\\b[0-9]{2})(?=[A-Z]\\b)"` – Jota Jan 25 '17 at 22:20
  • I downvoted as this post is not clear in what you're after. This caused people to waste time helping solve the wrong problem correctly. I'd change this vote if you made it clear. The easiest way to make it clear is to show the desired output. Often showing desired output makes a post 10x clearer. – Tyler Rinker Jan 25 '17 at 22:24
  • @Jota Works perfectly as long as you add another `\\b` to the lookahead. Thanks! – ApproachingDarknessFish Jan 25 '17 at 22:25
  • @TylerRinker Thank you for your feedback. I'll try to put more effort into my examples in the future to make it clearer what the desired behavior is. – ApproachingDarknessFish Jan 25 '17 at 22:25
  • @Jota does that work? I can't check it now, I'm away from my computer for a while. – Rich Scriven Jan 25 '17 at 23:01
  • What about [`strsplit("XYZ-02-01C-33D-2285", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)`](http://ideone.com/gIXBQi)? – Wiktor Stribiżew Jan 25 '17 at 23:57
  • 1
    @ApproachingDarknessFish Not too late to do it now. This makes your post useful to other people searching for similar problems. – Tyler Rinker Jan 26 '17 at 00:15
  • @RichScriven actually, no need for the boundary in the lookahead if you leave in the `[-.]`, as that part is compatible with how `strsplit` works: `"[-.]|(?\\b<=[0-9]{2})(?=[A-Z][-.])"` – Jota Jan 26 '17 at 01:36
  • I've clarified the example. I'll accept an answer that provides a regex with the lookbehind replaced with the boundary class. If no one posts one in the next six hours I'll self-answer. – ApproachingDarknessFish Jan 26 '17 at 01:50
  • @ApproachingDarknessFish I added what I believe is your desired output. Please correct if this is not true. – Tyler Rinker Jan 26 '17 at 17:50

3 Answers3

2

An alternative that prevents you from having to consider how the strsplit algorithm works, is to use your original regex with gsub to insert a simple splitting character in all the right places, then do use strsplit to do the straightforward splitting.

strsplit(
    gsub("((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", "-", x, perl = TRUE),
    "-", 
    fixed = TRUE)
#[[1]]
#[1] "XYZ"  "02"   "01"   "C"    "33"   "D"    "2285"

Of course, RichScriven's answer and Wiktor Stribiżew's comment are probably better since they only have one function call.

Jota
  • 17,281
  • 7
  • 63
  • 93
1

You may use a consuming version of a positive lookahead (a match reset operator \K) to make sure strsplit works correctly in R and avoid the problem of using a negative lookbehind inside a positive one.

"(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]"

See the R demo online (and a regex demo here).

strsplit("XYZ-02-01C-33D-2285", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
##    [1] "XYZ"  "02"   "01"   "C"    "33"   "D"    "2285"

strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
##    [1] "WXYZ" "AB"   "A4K7" "01"   "A"    "13"   "B"    "J29Q" "10" 

Here, the pattern matches:

  • (?<![^.-])\d{2}\K(?=[A-Z](?:[.-]|$)) - a sequence of:
    • (?<![^.-])\d{2} - 2 digits (\d{2}) that are not preceded with a char other than . and - (i.e. that are preceded with . or - or start of string, it is a common trick to avoid alternation inside a lookaround)
    • \K - the match reset operator that makes the regex engine discard the text matched so far and go on matching the subsequent subpatterns if any
  • | - or
  • [.-] - matches . or -.
Graham
  • 7,431
  • 18
  • 59
  • 84
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Thanks to Rich Scriven and Jota I was able to solve the problem. Every time strsplit finds a match, it removes the match and everything to its left before looking for the next match. This means that regex's that rely on lookbehinds may not function as expected when the lookbehind overlaps with a previous match. In my case, the hyphens between tokens were removed upon being matched, meaning that the second regex could not use them to detect the beginning of the token:

#first match found
"WXYZ-AB-A4K7-01A-13B-J29Q-10"
     ^

#match + left removed
"AB-A4K7-01A-13B-J29Q-10"

#further matches found and removed
"01A-13B-J29Q-10"

#second regex fails to match because of missing hyphen in lookbehind:
#((?<=[-.]\\d{2})(?=[A-Z][-.]))
# ^^^^^^^^
"01A-13B-J29Q-10"

#algorithm continues
"13B-J29Q-10"

This was fixed by replacing the [.-] class to detect the edges of the token in the lookbehind with a boundary anchor, as per Jota's suggestion:

> strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[-.]|(?<=\\b\\d{2})(?=[A-Z]\\b)", perl=T)
[[1]]
[1] "WXYZ" "AB"   "A4K7" "01"   "A"    "13"   "B"    "J29Q" "10"
ApproachingDarknessFish
  • 14,133
  • 7
  • 40
  • 79