In a Java application (running JVM version 17), I have a communication protocol where each line has the following structure:
<identifier> <space> <identifer>
The problem is that the identifiers themselves may contain (besides upper- and lowercase latin characters) (single) spaces so that it is unclear what purpose the space symbols have. Example:
Let the communication on the wire be:
abc def uvw xyz
Now, the separating space could have three different positions:
- First identifier:
abc
, second identifier:def uvw xyz
. - First identifier:
abc def
, second identifer:uvw xyz
. - First identifier:
abc def uvw
, second identifier:xyz
.
In the given case, technically this is not a problem: After parsing it is possible to verify each identifier, if it is valid (note that the set of identifier values is both "huge" - and hence you would not want to put it into a regular expression - and partially also unknown, but verifiable after the fact).
[Background for the ambiguous protocol: At the other end, a human being is sitting - and based on his/her role and situation, that person isn't able to think about ambiguity of what they are sending. Moreover, if a human mind reads the text, due to semantics and the meaning of the identifiers, it is obvious where to make the cut.]
The challenge to solve is to create an algorithm which creates all these possible combinations based on an arbitrary input.
For brevity, it may be assumed that there is no "prefix/suffix problem" between the identifiers, i.e. the identifiers are cut in such a way that a suffix of the first identifier isn't a prefix of the second identifier.
I already tried to start with a Java Pattern Regular Expression like
([A-Za-z ]+) ([A-Za-z ]+)
but here greediness always returns you the "last" variant from above, e.g.
group 1: abc def uvw
group 2: xyz
I also looked around at the various Regex modifiers, including also those not supported by Java (e. g. "Ungreedy"). So I played around with making the quantifier lazy or possessive, but no avail. I also looked at the JavaDoc API, playing around with .find()
and .results()
, but apparently backtracking has terminated and I cannot reinitiate it.
Due to some additional factors, it would be preferrable to have this parsing done using java.util.regex.Pattern
, but this is not mandatory.