I want to create an regex in order to break a string into words in a dictionary. If the string matches, I can iterate each group and make some change. some of the words are prefix of others. However, a regex like /(HH|HH12)+/
will not match string HH12HH
link. what's wrong with the regex? should it match the first HH12
and then HH
in the string?

- 29,063
- 15
- 95
- 142

- 2,942
- 1
- 24
- 40
-
What about `/(HH(?:12)?)+/` – Eli Sadoff Nov 14 '16 at 21:08
-
@EliSadoff I have to keep `HH` and `HH12` because when iterating the group I need to know it is `HH` or`HH12`. also, this is just an example. imaging that you only have an dictionary and `HH` and `HH12` are in the dictionary. words in the dictionary are changing as well. – qqibrow Nov 14 '16 at 21:11
-
Switch the alternations, it is matching `HH` first and then there is nothing more to match. Or add `$` to the end of the pattern. – Sebastian Proske Nov 14 '16 at 21:12
-
@SebastianProske thanks. but even through I add the `$`, there is still one group rather than 2. [link](https://regex101.com/r/6X6GDY/3) – qqibrow Nov 14 '16 at 21:16
-
Let me precise: you want to make sure the string consists of `HH12` or `HH` only, and if yes, tokenize into `HH` or `HH12`? Or do you only want to get consecutive `HH`/`HH12`? – Wiktor Stribiżew Nov 14 '16 at 21:47
3 Answers
In the string HH12HH
, the regex (HH|HH12)+
will work this way:
HH12HH
^ - both option work, continue
HH12HH
^ - First condition is entierly satisfied, mark it as match
HH12HH
^ - No Match
HH12HH
^ - No Match
As you setted the A
flag, which add the anchor to the start of the string, the rest will not raise a match. If you remove it, the pattern will match both HH
at the start & at the end.
In this case, you have three options:
- Put the longuest pattern first
/(HH12|HH)/Ag
. See demoThe one I prefer. - Mutualize the sharing part and use an optional group
/(HH(?:12)?)/Ag
. See second demo - Put a
$
at the end like so/(HH|HH12)$/Ag

- 29,063
- 15
- 95
- 142
-
-
thanks! but there is only one group match. what if I want all the groups. e.g, first group should match `HH12` second group `HH` – qqibrow Nov 14 '16 at 21:18
-
@qqibrow which language is it? Also, see [Repeating a Capturing Group vs. Capturing a Repeated Group](http://www.regular-expressions.info/captureall.html) – Thomas Ayoub Nov 14 '16 at 21:19
-
-
@qqibrow one easy solution is to use a global capturing group: `((?:HH12|HH)+)` – Thomas Ayoub Nov 14 '16 at 21:21
-
You want to match an entire string in Java that should only contain HH12
or HH
substrings. It is much easier to do in 2 steps: 1) check if the string meets the requirements (here, with matches("(?:HH12|HH)+")
), 2) extract all tokens (here, with HH12|HH
or HH(?:12)?
, since the first alternative in an unanchored alternation group "wins" and the rest are not considered).
String str = "HH12HH";
Pattern p = Pattern.compile("HH12|HH");
List<String> res = new ArrayList<>();
if (str.matches("(?:HH12|HH)+")) { // If the whole string consists of the defined values
Matcher m = p.matcher(str);
while (m.find()) {
res.add(m.group());
}
}
System.out.println(res); // => [HH12, HH]
See the Java demo
An alternative is a regex that will check if a string meets the requirements with a lookahead at the beginning, and then will match consecutive tokens with a \G
operator:
String str = "HH12HH";
Pattern p = Pattern.compile("(\\G(?!^)|^(?=(?:HH12|HH)+$))(?:HH12|HH)");
List<String> res = new ArrayList<>();
Matcher m = p.matcher(str);
while (m.find()) {
res.add(m.group());
}
System.out.println(res);
Details:
(\\G(?!^)|^(?=(?:HH12|HH)+$))
- the end of the previous successful match (\\G(?!^)
) or (|
) start of string (^
) that is followed with 1+ sequences ofHH12
orHH
((?:HH12|HH)+
) up to the end of string ($
)(?:HH12|HH)
- eitherHH12
orHH
.

- 607,720
- 39
- 448
- 563
-
great! what do you think of this http://stackoverflow.com/a/16817458/1646996? it also use `\\G` but doesn't looks like this complicated. – qqibrow Nov 15 '16 at 06:41
-
It depends on what you need, you have not precised. I explained in my answer: if you need to make sure the string *only* consists of the tokens meeting your pattern, you must first qualify the string for tokenization. That is why I suggested a more complex pattern with a lookahead and splitting the `\G` into `^(?=...)` and `\G(?!^)`. `\G(?:HH12|HH)` will match multiple tokens from the start of the string only, and if there is another text after them in the string, the matches will be still collected. – Wiktor Stribiżew Nov 15 '16 at 07:31
The problem you are having is entirely related to the way the regex engine decides what to match.
As I explained here, there are some regex flavors that pick the longest alternation... but you're not using one. Java's regex engine is the other type: the first matching alternation is used.
Your regex works a lot like this code:
if(bool1){
// This is where `HH` matches
} else if (bool1 && bool2){
// This is where `HH12` would match, but this code will never execute
}
The best way to fix this is to order your words in reverse, so that HH12
occurs before HH
.
Then, you can just match with an alteration:
HH12|HH
It should be pretty obvious what matches, since you can get the results of each match.
(You could also put each word in its own capture group, but that's a bit harder to work with.)