regex break string into words in dictionary

Question

I want to create an regex in order to break a string into words in a dictionary. If the string matches, I can iterate each group and make some change. some of the words are prefix of others. However, a regex like /(HH|HH12)+/ will not match string HH12HH link. what's wrong with the regex? should it match the first HH12 and then HH in the string?

@EliSadoff I have to keep `HH` and `HH12` because when iterating the group I need to know it is `HH` or`HH12`. also, this is just an example. imaging that you only have an dictionary and `HH` and `HH12` are in the dictionary. words in the dictionary are changing as well. — qqibrow, Nov 14 '16 at 21:11
Switch the alternations, it is matching `HH` first and then there is nothing more to match. Or add `$` to the end of the pattern. — Sebastian Proske, Nov 14 '16 at 21:12
@SebastianProske thanks. but even through I add the `$`, there is still one group rather than 2. [link](https://regex101.com/r/6X6GDY/3) — qqibrow, Nov 14 '16 at 21:16
Let me precise: you want to make sure the string consists of `HH12` or `HH` only, and if yes, tokenize into `HH` or `HH12`? Or do you only want to get consecutive `HH`/`HH12`? — Wiktor Stribiżew, Nov 14 '16 at 21:47

Thomas Ayoub · Answer 1 · 2016-11-14T21:18:36.927

1

In the string HH12HH, the regex (HH|HH12)+ will work this way:

HH12HH
^ - both option work, continue
HH12HH
 ^ - First condition is entierly satisfied, mark it as match
HH12HH
  ^ - No Match
HH12HH
   ^ - No Match

As you setted the A flag, which add the anchor to the start of the string, the rest will not raise a match. If you remove it, the pattern will match both HH at the start & at the end.

In this case, you have three options:

Put the longuest pattern first /(HH12|HH)/Ag. See demo^{The one I prefer.}
Mutualize the sharing part and use an optional group /(HH(?:12)?)/Ag. See second demo
Put a $ at the end like so /(HH|HH12)$/Ag

edited Nov 14 '16 at 21:18

answered Nov 14 '16 at 21:12

Thomas Ayoub

29,063
15
95
142

then how to match the entire string? – qqibrow Nov 14 '16 at 21:15
thanks! but there is only one group match. what if I want all the groups. e.g, first group should match `HH12` second group `HH` – qqibrow Nov 14 '16 at 21:18
@qqibrow which language is it? Also, see [Repeating a Capturing Group vs. Capturing a Repeated Group](http://www.regular-expressions.info/captureall.html) – Thomas Ayoub Nov 14 '16 at 21:19
java. is there a way? – qqibrow Nov 14 '16 at 21:20
@qqibrow one easy solution is to use a global capturing group: `((?:HH12|HH)+)` – Thomas Ayoub Nov 14 '16 at 21:21
but still, there is only one matching group. – qqibrow Nov 14 '16 at 21:42

Wiktor Stribiżew · Accepted Answer · 2016-11-14T22:35:53.050

You want to match an entire string in Java that should only contain HH12 or HH substrings. It is much easier to do in 2 steps: 1) check if the string meets the requirements (here, with matches("(?:HH12|HH)+")), 2) extract all tokens (here, with HH12|HH or HH(?:12)?, since the first alternative in an unanchored alternation group "wins" and the rest are not considered).

String str = "HH12HH";
Pattern p = Pattern.compile("HH12|HH");
List<String> res = new ArrayList<>();
if (str.matches("(?:HH12|HH)+")) { // If the whole string consists of the defined values
    Matcher m = p.matcher(str);
    while (m.find()) {
        res.add(m.group());
    }
}
System.out.println(res); // => [HH12, HH]

See the Java demo

An alternative is a regex that will check if a string meets the requirements with a lookahead at the beginning, and then will match consecutive tokens with a \G operator:

String str = "HH12HH";
Pattern p = Pattern.compile("(\\G(?!^)|^(?=(?:HH12|HH)+$))(?:HH12|HH)");
List<String> res = new ArrayList<>();
Matcher m = p.matcher(str);
while (m.find()) {
    res.add(m.group());
}
System.out.println(res);

See another Java demo

Details:

(\\G(?!^)|^(?=(?:HH12|HH)+$)) - the end of the previous successful match (\\G(?!^)) or (|) start of string (^) that is followed with 1+ sequences of HH12 or HH ((?:HH12|HH)+) up to the end of string ($)
(?:HH12|HH) - either HH12 or HH.

great! what do you think of this http://stackoverflow.com/a/16817458/1646996? it also use `\\G` but doesn't looks like this complicated. — qqibrow, Nov 15 '16 at 06:41
It depends on what you need, you have not precised. I explained in my answer: if you need to make sure the string *only* consists of the tokens meeting your pattern, you must first qualify the string for tokenization. That is why I suggested a more complex pattern with a lookahead and splitting the `\G` into `^(?=...)` and `\G(?!^)`. `\G(?:HH12|HH)` will match multiple tokens from the start of the string only, and if there is another text after them in the string, the matches will be still collected. — Wiktor Stribiżew, Nov 15 '16 at 07:31

score 0 · Answer 3 · edited May 23 '17 at 11:59

The problem you are having is entirely related to the way the regex engine decides what to match.

As I explained here, there are some regex flavors that pick the longest alternation... but you're not using one. Java's regex engine is the other type: the first matching alternation is used.

Your regex works a lot like this code:

if(bool1){
    // This is where `HH` matches
} else if (bool1 && bool2){
    // This is where `HH12` would match, but this code will never execute
}

The best way to fix this is to order your words in reverse, so that HH12 occurs before HH.

Then, you can just match with an alteration:

HH12|HH

It should be pretty obvious what matches, since you can get the results of each match.

(You could also put each word in its own capture group, but that's a bit harder to work with.)

regex break string into words in dictionary

3 Answers3