Is it possible to make a modified pattern so that when a split is applied the separator will be whatever does NOT match the base pattern?

Question

In a recent use of String.split(), I was faced with a situation where the text was so dynamic, it is easier to pick up the matches than to filter out the non-matches.

I caught myself wondering if it's possible to modify a "reverse regex" for String.split() so that you can give it any pattern and it will match every group of characters that does NOT match that pattern.

*NOTE: The "problem" here can easily solved with String.matches(), Tokens, Matcher.group(), etc. This question is mostly hypothetical (code samples are still welcome, as the question's nature pretty much requires it), and it's not about how to achieve the results, but about if it's a possible to achieve them this way.

What i tried:

String pattern1 = "(test)"; //A verif. that what "should-not-match" is working correctly.
        String pattern2 = "[^(test)]"; //FAIL - unmatches the letters separately.
        String pattern3 = "(^(test))"; //FAIL - does not match anything, it seems.
        String text = ""
                        + "This is a test. "
                        + "This test should (?not?) match the word \"test\", whenever it appears.\n"
                        + "This is about to test if a \"String.split()\" can be used in a different way.\n"
                        + "By the way, \"testing\" does not equal \"test\","
                        + "but it will split in the middle because it contains \"test\".";
        for (String s : text.split(pattern3)) {
            System.out.println(s);
        }

And other, similar patterns, none of which was anywhere near successful.

UPDATE:

I have now attempted a few patterns using the special constructors as well, but didn't get it to work yet either.

As for what i want, following the "test" example, is to get an array containing strings whose content is "text" (What i want to use as base pattern, or in other words what i want to FIND).

But do this using String.split(), with makes using the base pattern directly result in "whatever is not (test)", thus needing a reversal in order to result "just the occurrences of (test)".

Bible-sized-long-story-short, the wanted is regex for String.split() that results in this behavior (+result): NOTE: follows the example code above, including needed variables (text).

String[] trash = text.split("test"); //<-base pattern, needs reversing.
        System.out.println("\n\nWhat should match the split-pattern (due reversal), become separators, and be filtered out:");
        for (String s : trash) {
            System.out.println("[" + s + "]");
            text = text.replace(s, "%!%"); //<-simulated wanted behavior.
        }
        System.out.println("\n\nWhat should be the resulting String[]:");
        for (String s : text.split("%!%")) {
            System.out.println(s);
        }
        System.out.println("Note: There is a blank @ index [0], since if the text does not start with \"test\", there is a sep. between. This is NOT WRONG.");

Code samples are welcome. The possibility (or not) to create such code is this question's nature after all.

score 3 · Answer 1 · answered Jul 19 '12 at 18:02

3

You may be talking about the (?! construct.

It is documented in the javadoc for the Pattern class. They call it a negative look-ahead assertion.

The most straightforward way to solve your problem is a repeated find.

    Pattern p = Pattern.compile(regexForThingIWant);
    Matcher m = p.matcher(str);
    int cursor = 0;
    while (m.find(cursor)) {
      String x = m.group();
      // do something with x
      cursor = m.end();
    }

I was able to kludge up a regexp for a split that seems to do what you want, but badly:

(^|(?<=test))((?!test).)*

answered Jul 19 '12 at 18:02

Mutant Bob

3,121
2
27
52

The regex is not quite right yet. It will just "eat" everything that is after the word "test" on a line. – Jirka Hanika Jul 19 '12 at 18:19
@Mutant_Bob plz note that, as stated, the question is NOT about "a way to solve the problem", it's about doing so with `String.split()` and solely based on "reversal" of the matching pattern's behavior. --- I've tried out the `negative look-ahead` construct, but i was not successful. (Q will be edited to include it) Still, it gave me some insight on how to use it, so it was of some use. – CosmicGiant Jul 19 '12 at 19:02

score 0 · Answer 2 · edited May 23 '17 at 11:48

It is not easy for me to see what output from the split you want to see, because your only hints are part of the test string, and then only indirect (like that you want the word testing to come out in two pieces).

Well, let's try a positive lookbehind:

^|(?<=test)

This returns

This is a test
. This test
 should (?not?) match the word "test
", whenever it appears.
This is about to test
 if a "String.split()" can be used in a different way.
By the way, "test
ing" does not equal "test
",but it will split in the middle because it contains "test
".

Is that what you wanted?

Note that when splitting a text in such a way that neither "matching" and "non-matching" bits of the input (in the loose sense) are consumed by the process of splitting, you need to engineer the regex so that it only matches (some) empty strings, in the technical sense of the word "match".

Lookaheads and lookbehinds are therefore your almost only tools to solve such tasks using regular expressions.

However, if you prefer all non-test parts to be consumed, that is achievable, too.

(?<=^|(test))(tes[^t]|te[^s]|t[^e]|[^t])*

It is the same lookbehind followed by consuming anything that does not look like the word test.

This method is not completely general, though. This question explains the limitations.

@"Lookaheads and lookbehinds are therefore your almost only tools to solve such tasks using regular expressions." --- Yep...and that's what got me wondering if a regex solution is even possible. And i'm starting to think it isn't. "o_0 — CosmicGiant, Jul 19 '12 at 19:54
What i want is to "not-match (test)", in a way where "test" can mean any given pattern (it's the example's "base pattern"). The result, in this example, would be an array containing ["test","test","test",...], possibly containing a "blank string" at index [0] if the text does not start with "test" (or whatever the "base pattern" is). -- See updated question. — CosmicGiant, Jul 19 '12 at 20:29

Is it possible to make a modified pattern so that when a split is applied the separator will be whatever does NOT match the base pattern?

2 Answers2