2

Can anyone explain to me why my regular expression:

[a-zA-Z]+\s(.+?)+\s(?:pr.|g.|a.)|[A-Z][a-zA-Z]+\.\s[a-zA-Z+\s(?:pr.|g.|a.)|[a-zA-Z]+\s(?:pr.|g.|a.)|[a-zA-Z]+\s[a-zA-Z]+\s(?:pr.|g.|a.)

With input:

Testės r. sav., Žtestčių Vhest sen., Platakių k.

is causing catastrophic Backtracking?

On the other hand, if I add ^ and $, like:

^[a-zA-Z]+\s(.+?)+\s(?:pr.|g.|a.)|[A-Z][a-zA-Z]+\.\s[a-zA-Z+\s(?:pr.|g.|a.)|[a-zA-Z]+\s(?:pr.|g.|a.)|[a-zA-Z]+\s[a-zA-Z]+\s(?:pr.|g.|a.)$

There is no backtracking.

(tested with regex101.com)

However, when I try to use regular expression (^$ included) with Java:

Matcher matcher = myRegexPattern.matcher(string);
if(matcher.find()){
    //
}

Backtrack persists... Any ideas how to avoid backtrack using java?

CrazySabbath
  • 1,274
  • 3
  • 11
  • 33
  • have you escaped special characters when introducing `String` in Java? i.e. `\\\` – Jordi Castilla May 09 '16 at 08:38
  • 1
    What is your expected output? – AKS May 09 '16 at 08:38
  • The main idea: write a *linear* regex. Note that all the branches in your alternation pattern can match at the same location - that increases redundant backtracking. Also, you seem to have a missing `]` at `[a-zA-Z+\s(?:pr.|g.|a.)` and the `+` at `(.+?)+` looks like a real culprit. – Wiktor Stribiżew May 09 '16 at 08:41
  • This regular expression should return nothing, matcher.find() should return false. I escaped every `[` in regular expression, and it works...any idea why? I mean string doesn't even have `[` character. – CrazySabbath May 09 '16 at 08:43
  • As for missing `]`, that's copy/paste error. – CrazySabbath May 09 '16 at 08:49
  • 2
    Side note on code quality: if you think your regex gives you a hard time, right now, while you are in the midst of developing it ... what exactly do you expect to happen when you have to slightly change it in a few weeks, because of a subtle change in your requirements? In other words: maybe, sometimes, there are better ways to parse strings than one huge monster regex?! – GhostCat May 09 '16 at 08:51
  • @Jägermeister I wish there was a better way to parse it, unfortunately there is not (or I can't figure out a better way at least) – CrazySabbath May 09 '16 at 08:56
  • The way you have the regex now, `[a-zA-Z]+\s(.+?)\s(?:(?:pr|g|a)\.)` should work for you. – Wiktor Stribiżew May 09 '16 at 09:06
  • Have you tried the expression I suggested? Does it fix your issue? – Wiktor Stribiżew May 09 '16 at 09:55
  • Unfortunately, no, it does not fix my issue. While it fixes backtrack issue, it doesn't give correct output. Example: `Testės r. sav., Žtestčių Vhest sen., Platakių g.` should return `Platakių g.`, instead your regex returns `r. sav., Žtestčių Vhest sen., Platakių g.` I solved this problem based on this: http://stackoverflow.com/questions/910740/cancelling-a-long-running-regex-match – CrazySabbath May 09 '16 at 10:37
  • That is not a fix, that is a workaround. If you are interested in a fix, please add the specifications to the question in order to write a correct and efficient pattern. [Your regex will match `s r. sav., Žtestčių Vhest sen., Platakių g.`](https://regex101.com/r/jM0xX3/1), not `Platakių g.` – Wiktor Stribiżew May 09 '16 at 13:17
  • 1
    The current regex you have can be just reduced to `[a-zA-Z]+\s(?:(.+?)\s)?(?:(?:pr|g|a)\.)|[A-Z][a-zA-Z]+\.\s[a-zA-Z]+\s(?:(?:pr|g|a)\.)` and it does not make much sense. – Wiktor Stribiżew May 09 '16 at 13:43

0 Answers0