How to parse multiple identifier within in a string using Java (Pattern), if the separator character is ambiguous? Return all variants

Question

In a Java application (running JVM version 17), I have a communication protocol where each line has the following structure:

<identifier> <space> <identifer>

The problem is that the identifiers themselves may contain (besides upper- and lowercase latin characters) (single) spaces so that it is unclear what purpose the space symbols have. Example:

Let the communication on the wire be:

abc def uvw xyz

Now, the separating space could have three different positions:

First identifier: abc, second identifier: def uvw xyz.
First identifier: abc def, second identifer: uvw xyz.
First identifier: abc def uvw, second identifier: xyz.

In the given case, technically this is not a problem: After parsing it is possible to verify each identifier, if it is valid (note that the set of identifier values is both "huge" - and hence you would not want to put it into a regular expression - and partially also unknown, but verifiable after the fact).

[Background for the ambiguous protocol: At the other end, a human being is sitting - and based on his/her role and situation, that person isn't able to think about ambiguity of what they are sending. Moreover, if a human mind reads the text, due to semantics and the meaning of the identifiers, it is obvious where to make the cut.]

The challenge to solve is to create an algorithm which creates all these possible combinations based on an arbitrary input.

For brevity, it may be assumed that there is no "prefix/suffix problem" between the identifiers, i.e. the identifiers are cut in such a way that a suffix of the first identifier isn't a prefix of the second identifier.

I already tried to start with a Java Pattern Regular Expression like

([A-Za-z ]+) ([A-Za-z ]+)

but here greediness always returns you the "last" variant from above, e.g.

group 1: abc def uvw
group 2: xyz

I also looked around at the various Regex modifiers, including also those not supported by Java (e. g. "Ungreedy"). So I played around with making the quantifier lazy or possessive, but no avail. I also looked at the JavaDoc API, playing around with .find() and .results(), but apparently backtracking has terminated and I cannot reinitiate it.

Due to some additional factors, it would be preferrable to have this parsing done using java.util.regex.Pattern, but this is not mandatory.

I don't do much Java, but why can't you just iterate over the location of each space, checking whether the before and after are both valid? (Pseudocode: `while (matcher.find()) { if (is_ident(s.substring(0, matcher.start())) && is_ident(s.substring(matcher.end())) {...} }`) — rici, Feb 13 '23 at 15:34
Although if the separators are always single space characters, using a regex to find them is overkill. You could just use String.indexOf. — rici, Feb 13 '23 at 15:42
Would definitively be worth a SO answer :) Extension of the question: I have also further cases, where I have three (or more) identifiers to parse. Your approach definitively would work for two (so, it's a valid answer for the question here), but for three, it'll get more complicated. — EagleRainbow, Feb 13 '23 at 21:00

score 0 · Answer 1 · answered Feb 15 '23 at 05:56

Why not String.split(String)?

If you split your input, you can then scan through looking for which combinations of words is the identifier.

String stringin; // value from somewhere
String theWord; // the starting identifier

String[] words = stringin.split(" ");

for (int width = 1; width < words.length; width++) {
    for (int start = 0; start + width - 1 < words.length; start++) {
        if (Arrays.copyOfRange(words, start, start + width - 1).join(" ") == theWord)
            return Arrays.copyOfRange(words, start, start + width - 1).join(" ");
    }
}

See https://stackoverflow.com/questions/513832/how-do-i-compare-strings-in-java and the repeated `Arrays.copyOfRange(...).join(" ")` should be a variable. — Clashsoft, May 21 '23 at 15:02

Reilas · Answer 2 · 2023-05-21T15:04:16.237

There is no need for a Pattern object.

You can use the following code to derive a list of ids for a specified String.

List<String> ids(String string) {
    List<String> ids = new ArrayList<>();
    int indexOf, offset = -1;
    while ((indexOf = string.indexOf(' ', offset + 1)) != -1)
        ids.add(string.substring(0, offset = indexOf));
    return ids;
}

Output

[abc, abc def, abc def uvw]

In response to your inquery of regular-expression greediness.
It's one or the other, it will either consume all matches, or the least possible matches.

The reason for this being that a regular-expression evaluation works at a minimum, simply reading left to right, attempting to match the parameters; there is no increment adjustment, or evaluation.

I highly recommend reviewing the source code for Matcher.
If you can't view it from your IDE, you can view it online, at GitHub.

Here is the link.
https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/regex/Matcher.java

How to parse multiple identifier within in a string using Java (Pattern), if the separator character is ambiguous? Return all variants

2 Answers2