Java Pattern.split() with overlapping delimiters

Question

Firstly, I'm aware of similar questions that have been asked such as here:

How to split a string, but also keep the delimiters?

However, I'm having issue implementing a split of a string using Pattern.split() where the pattern is based on a list of delimiters, but where they can sometimes appear to overlap. Here is the example:

The goal is to split a string based on a set of known codewords which are surrounded by slashes, where I need to keep both the delimiter (codeword) itself and the value after it (which may be empty string).

For this example, the codewords are:

/ABC/
/DEF/
/GHI/

Based on the thread referenced above, the pattern is built as follows using look-ahead and look-behind to tokenise the string into codewords AND values:

((?<=/ABC/)|(?=/ABC/))|((?<=/DEF/)|(?=/DEF/))|((?<=/GHI/)|(?=/GHI/))

Working string:

"123/ABC//DEF/456/GHI/789"

Using split, this tokenises nicely to:

"123","/ABC/","/DEF/","456","/GHI/","789"

Problem string (note single slash between "ABC" and "DEF"):

"123/ABC/DEF/456/GHI/789"

Here the expectation is that "DEF/456" is the value after "/ABC/" codeword because the "DEF/" bit is not actually a codeword, but just happens to look like one!

Desired outcome is:

"123","/ABC/","DEF/456","/GHI/","789"

Actual outcome is:

"123","/ABC","/","DEF/","456","/GHI/","789"

As you can see, the slash between "ABC" and "DEF" is getting isolated as a token itself.

I've tried solutions as per the other thread using only look-ahead OR look-behind, but they all seem to suffer from the same issue. Any help appreciated!

At some point your regex becomes so convoluted that you're better off writing a simple method to walk the `String` and parse it... — Boris the Spider, Dec 15 '16 at 15:06
With the look-ahead + look-behind method, I want the "789" to be separate, as per the "working string" example (I later "re-assemble" the codewords and values from the token list into an associative array). — Julian Mclean, Dec 15 '16 at 15:34
@BoristheSpider - point taken, however the approach of tokenising the string in a "single hit" seemed like a safe and efficient way of extracting the tokens/values. The codeword list is also configurable, so this regex is built dynamically and I am somewhat less interested in how complex it becomes (so long as it performs). — Julian Mclean, Dec 15 '16 at 15:40
Why is that the "desired" outcome? I would say the expected outcome is `"123", "/ABC/", "", "/DEF/", "456", "/GHI/", "789"`, ie there's a blank token between `/ABC/` and `/DEF/`. Why is that not the case? — Bohemian, Dec 15 '16 at 15:49
@Bohemian - that is also an acceptable (arguably better) outcome, as ultimately this gets re-assembled into an associative array. However, this is currently managed in a post-processing that identifies where subsequent tokens are actually known codewords, and deals with it there. If you can suggest a resolution to that problem within the Regex, I would also be interested in that! — Julian Mclean, Dec 15 '16 at 16:09
As a first step, can anyone actually explain the behaviour I am seeing? I get the impression it is cause by the look-ahead/look-behind which enables gathering the tokens in the split. If I don't use look-ahead/look-behind, the values seem to come out ok, but I then don't know know the codeword (delimiter) values, which I need. I guess might be able to negate the regex and double-pass it to separately get the codewords, then splice together, but feels a bit dangerous/hacky. — Julian Mclean, Dec 15 '16 at 16:14
@julian split can't split out a blank without moving the pointer, so extracting a blank between non consumed input is impossible. However, I've posted an answer that gives you your desired outcome. — Bohemian, Dec 15 '16 at 16:17
In the case of the *problem string*, why the third token is `"/DEF/456"` and not `"DEF/456"` ? — Spotted, Dec 16 '16 at 07:28

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

2

If you are OK with find rather than split, using some non-greedy matches, try this:

public class SampleJava {
static final String[] CODEWORDS = {
    "ABC",
    "DEF",
    "GHI"};
static public void main(String[] args) {
    String input = "/ABC/DEF/456/GHI/789";
    String codewords = Arrays.stream(CODEWORDS)
            .collect(Collectors.joining("|", "/(", ")/"));
    //     codewords = "/(ABC|DEF|GHI)/";
    Pattern p = Pattern.compile(
/* codewords */ ("(DELIM)"
/* pre-delim */ + "|(.+?(?=DELIM))"
/* final bit */ + "|(.+?$)").replace("DELIM", codewords));
    Matcher m = p.matcher(input);
    while(m.find()) {
        System.out.print(m.group(0));
        if(m.group(1) != null) {
            System.out.print(" ← code word");
        }
        System.out.println();
    }
}
}

Output:

/ABC/ ← code word

DEF/456

/GHI/ ← code word

789

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 15 '16 at 16:13

Patrick Parker

4,863
4
19
51

1

This one looks interesting, and based on quick testing on http://regex-testdrive.com/ it seems to cope with the problem. I'll confess that it is not my actual fingers doing the coding here, so I'll get my devs to experiment with this and report back and accept the answer if it works. – Julian Mclean Dec 15 '16 at 16:20
@JulianMclean an added benefit of this approach is that you don't need to do a separate loop/regex to see if the token is a code word. See my latest edit for details. – Patrick Parker Dec 15 '16 at 17:25

Bohemian · Answer 2 · 2016-12-16T10:12:42.730

1

Use a combination of positive and negative look arounds:

String[] parts = s.split("(?<=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI)/....)|(?=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI))");

There's also a considerable simplification by using alternations inside single look ahead/behind.

See live demo.

edited Dec 16 '16 at 10:12

answered Dec 15 '16 at 16:11

Bohemian

412,405
93
575
722

Get the point about the alternations, and agree - as per my comment above, my fingers are not doing the actual coding here so I wanted to post the regex that was actually being used - agree that it can be optimised a bit. The live demo seems to be using the "working string" from my original question, rather than the problem one? Also, it is then not achieving the desired result because "/DEF/456" is coming out as a single token - see my original question examples. Thanks for the help here though, feels like we're getting closer! – Julian Mclean Dec 15 '16 at 16:25
@julian yes, we were close. Now very close. I fine tuned the matching by adding negative look behinds to exclude undesirable matches. Demo link updated too. – Bohemian Dec 16 '16 at 10:15
A quick test in online tester seems to show this is working. My devs had already started working on the solution above from Patrick, but wanted to let others know, as this is potentially the most "direct" answer to my question, which was to get the regex working. Not sure if I can accept 2 answers or not...let me see – Julian Mclean Dec 16 '16 at 12:28
@julian you can't accept 2 answers, but you can change your mind by accepting a different answer (which will automatically un-accept the other answer). As per the ["accept "help page](http://stackoverflow.com/help/someone-answers), you should accept the answer that you believe is *the best solution to your problem*. – Bohemian Dec 16 '16 at 14:19
@JulianMclean not sure if you used this Bohemian split or not, but I played around with it and found it can fail for some values. For example, with "/ABC/DEF/GHI/" it will recognize ABC as a codeword but not GHI. – Patrick Parker Feb 01 '17 at 21:05

Spotted · Answer 3 · 2016-12-16T10:51:14.233

Following some TDD principles (Red-Green-Refactor), here is how I would implement such behaviour:

Write specs (Red)

I defined a set of unit tests that explain how I understood your "tokenization process". If any test is not correct according to what you expect, feel free to tell me and I'll edit my answer accordingly.

import static org.assertj.core.api.Assertions.assertThat;

import java.util.List;

import org.junit.Test;

public class TokenizerSpec {

    Tokenizer tokenizer = new Tokenizer("/ABC/", "/DEF/", "/GHI/");

    @Test
    public void itShouldTokenizeTwoConsecutiveCodewords() {
        String input = "123/ABC//DEF/456";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("123", "/ABC/", "/DEF/", "456");
    }

    @Test
    public void itShouldTokenizeMisleadingCodeword() {
        String input = "123/ABC/DEF/456/GHI/789";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("123", "/ABC/", "DEF/456", "/GHI/", "789");
    }

    @Test
    public void itShouldTokenizeWhenValueContainsSlash() {
        String input = "1/23/ABC/456";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("1/23", "/ABC/", "456");
    }

    @Test
    public void itShouldTokenizeWithoutCodewords() {
        String input = "123/456/789";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("123/456/789");
    }

    @Test
    public void itShouldTokenizeWhenEndingWithCodeword() {
        String input = "123/ABC/";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("123", "/ABC/");
    }

    @Test
    public void itShouldTokenizeWhenStartingWithCodeword() {
        String input = "/ABC/123";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("/ABC/", "123");
    }

    @Test
    public void itShouldTokenizeWhenOnlyCodeword() {
        String input = "/ABC//DEF//GHI/";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("/ABC/", "/DEF/", "/GHI/");
    }
}

Implement according to the specs (Green)

This class make all the tests above pass

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;

public final class Tokenizer {

    private final List<String> codewords;

    public Tokenizer(String... codewords) {
        this.codewords = Arrays.asList(codewords);
    }

    public List<String> splitPreservingCodewords(String input) {
        List<String> tokens = new ArrayList<>();

        int lastIndex = 0;
        int i = 0;
        while (i < input.length()) {
            final int idx = i;
            Optional<String> codeword = codewords.stream()
                                                 .filter(cw -> input.substring(idx).indexOf(cw) == 0)
                                                 .findFirst();
            if (codeword.isPresent()) {
                if (i > lastIndex) {
                    tokens.add(input.substring(lastIndex, i));
                }
                tokens.add(codeword.get());
                i += codeword.get().length();
                lastIndex = i;
            } else {
                i++;
            }
        }

        if (i > lastIndex) {
            tokens.add(input.substring(lastIndex, i));
        }

        return tokens;
    }
}

Improve implementation (Refactor)

Not done at the moment (not enough time that I can spend on that answer now). I'll do some refactor on Tokenizer with pleasure if you request me to (but later). :-) Or you can do it yourself quite securely since you have the unit tests to avoid regressions.

Java Pattern.split() with overlapping delimiters

3 Answers3

Write specs (Red)

Implement according to the specs (Green)

Improve implementation (Refactor)