3

I want to create an regex in order to break a string into words in a dictionary. If the string matches, I can iterate each group and make some change. some of the words are prefix of others. However, a regex like /(HH|HH12)+/ will not match string HH12HH link. what's wrong with the regex? should it match the first HH12 and then HH in the string?

Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
qqibrow
  • 2,942
  • 1
  • 24
  • 40
  • What about `/(HH(?:12)?)+/` – Eli Sadoff Nov 14 '16 at 21:08
  • @EliSadoff I have to keep `HH` and `HH12` because when iterating the group I need to know it is `HH` or`HH12`. also, this is just an example. imaging that you only have an dictionary and `HH` and `HH12` are in the dictionary. words in the dictionary are changing as well. – qqibrow Nov 14 '16 at 21:11
  • Switch the alternations, it is matching `HH` first and then there is nothing more to match. Or add `$` to the end of the pattern. – Sebastian Proske Nov 14 '16 at 21:12
  • @SebastianProske thanks. but even through I add the `$`, there is still one group rather than 2. [link](https://regex101.com/r/6X6GDY/3) – qqibrow Nov 14 '16 at 21:16
  • Let me precise: you want to make sure the string consists of `HH12` or `HH` only, and if yes, tokenize into `HH` or `HH12`? Or do you only want to get consecutive `HH`/`HH12`? – Wiktor Stribiżew Nov 14 '16 at 21:47

3 Answers3

1

In the string HH12HH, the regex (HH|HH12)+ will work this way:

HH12HH
^ - both option work, continue
HH12HH
 ^ - First condition is entierly satisfied, mark it as match
HH12HH
  ^ - No Match
HH12HH
   ^ - No Match

As you setted the A flag, which add the anchor to the start of the string, the rest will not raise a match. If you remove it, the pattern will match both HH at the start & at the end.

In this case, you have three options:

  • Put the longuest pattern first /(HH12|HH)/Ag. See demoThe one I prefer.
  • Mutualize the sharing part and use an optional group /(HH(?:12)?)/Ag. See second demo
  • Put a $ at the end like so /(HH|HH12)$/Ag
Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
1

You want to match an entire string in Java that should only contain HH12 or HH substrings. It is much easier to do in 2 steps: 1) check if the string meets the requirements (here, with matches("(?:HH12|HH)+")), 2) extract all tokens (here, with HH12|HH or HH(?:12)?, since the first alternative in an unanchored alternation group "wins" and the rest are not considered).

String str = "HH12HH";
Pattern p = Pattern.compile("HH12|HH");
List<String> res = new ArrayList<>();
if (str.matches("(?:HH12|HH)+")) { // If the whole string consists of the defined values
    Matcher m = p.matcher(str);
    while (m.find()) {
        res.add(m.group());
    }
}
System.out.println(res); // => [HH12, HH]

See the Java demo

An alternative is a regex that will check if a string meets the requirements with a lookahead at the beginning, and then will match consecutive tokens with a \G operator:

String str = "HH12HH";
Pattern p = Pattern.compile("(\\G(?!^)|^(?=(?:HH12|HH)+$))(?:HH12|HH)");
List<String> res = new ArrayList<>();
Matcher m = p.matcher(str);
while (m.find()) {
    res.add(m.group());
}
System.out.println(res);

See another Java demo

Details:

  • (\\G(?!^)|^(?=(?:HH12|HH)+$)) - the end of the previous successful match (\\G(?!^)) or (|) start of string (^) that is followed with 1+ sequences of HH12 or HH ((?:HH12|HH)+) up to the end of string ($)
  • (?:HH12|HH) - either HH12 or HH.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • great! what do you think of this http://stackoverflow.com/a/16817458/1646996? it also use `\\G` but doesn't looks like this complicated. – qqibrow Nov 15 '16 at 06:41
  • It depends on what you need, you have not precised. I explained in my answer: if you need to make sure the string *only* consists of the tokens meeting your pattern, you must first qualify the string for tokenization. That is why I suggested a more complex pattern with a lookahead and splitting the `\G` into `^(?=...)` and `\G(?!^)`. `\G(?:HH12|HH)` will match multiple tokens from the start of the string only, and if there is another text after them in the string, the matches will be still collected. – Wiktor Stribiżew Nov 15 '16 at 07:31
0

The problem you are having is entirely related to the way the regex engine decides what to match.

As I explained here, there are some regex flavors that pick the longest alternation... but you're not using one. Java's regex engine is the other type: the first matching alternation is used.

Your regex works a lot like this code:

if(bool1){
    // This is where `HH` matches
} else if (bool1 && bool2){
    // This is where `HH12` would match, but this code will never execute
}

The best way to fix this is to order your words in reverse, so that HH12 occurs before HH.

Then, you can just match with an alteration:

HH12|HH

It should be pretty obvious what matches, since you can get the results of each match.

(You could also put each word in its own capture group, but that's a bit harder to work with.)

Community
  • 1
  • 1
Laurel
  • 5,965
  • 14
  • 31
  • 57