1

The code is actually in Scala (Spark/Scala) but the library scala.util.matching.Regex, as per the documentation, delegates to java.util.regex.

The code, essentially, reads a bunch of regex from a config file and then matches them against logs fed to the Spark/Scala app. Everything worked fine until I added a regex to extract strings separated by tabs where the tab has been flattened to "#011" (by rsyslog). Since the strings can have white-spaces, my regex looks like: (.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)

The moment I add this regex to the list, the app takes forever to finish processing logs. To give you an idea of the magnitude of delay, a typical batch of a million lines takes less than 5 seconds to match/extract on my Spark cluster. If I add the expression above, a batch takes an hour!

In my code, I have tried a couple of ways to match regex:

  1. if ( (regex findFirstIn log).nonEmpty ) { do something }

  2. val allGroups = regex.findAllIn(log).matchData.toList if (allGroups.nonEmpty) { do something }

  3. if (regex.pattern.matcher(log).matches()){do something}

All three suffer from poor performance when the regex mentioned above it added to the list of regex. Any suggestions to improve regex performance or change the regex itself?

The Q/A that's marked as duplicate has a link that I find hard to follow. It might be easier to follow the text if the referenced software, regexbuddy, was free or at least worked on Mac.

I tried negative lookahead but I can't figure out how to negate a string. Instead of /(.+?)#011/, something like /([^#011]+)/ but that just says negate "#" or "0" or "1". How do I negate "#011"? Even after that, I am not sure if negation will fix my performance issue.

Community
  • 1
  • 1
Joe Nate
  • 159
  • 10
  • Why do you repeat #011(.+?) multiple times ? – JFPicard May 15 '15 at 18:23
  • AFAIK the regex is slow not because of the lazy/reluctant quantifiers, but because the underlying backtracking implementation is inherently slow. The real long-term fix is to change Java's engine to an NFA automaton matcher. In the meantime, for you as a workaround to the regex woes, how about using String.split() on "#011" and doing processing based on that? – Nayuki May 15 '15 at 19:35
  • For reference: https://swtch.com/~rsc/regexp/regexp1.html (exponential vs. polynomial time for regexes with many quantifiers) – Nayuki May 15 '15 at 19:37
  • @NayukiMinase Can't use split() because the code isn't specific to a specific log/string. The code/app takes a list of regex and for every log line fed to the app, tries to match a regex from the list. One approach I am considering is writing a regex function is C and calling it from Scala. – Joe Nate May 15 '15 at 19:46
  • 1
    I reopened this question because I don't think non-greediness is necessarily the problem. You say you're applying the regex to one line at a time; why aren't you using anchors (`^` and `$`)? Also, are there any other `#` symbols besides the ones in the flattened TABs? Maybe you could use `([^#]++)` instead of `(.+?)`. – Alan Moore May 15 '15 at 19:53
  • Do you really need all the match groups? If not, maybe this regex will speed things up a bit: `(?:(?:(?!#011).)++#011){34}(?:.+?)` – Bart Kiers May 15 '15 at 19:58
  • @JoeNate Why the do you think a function in C will be faster? It will be faster if you get a better regex engine and it will be slower if you get a worse one. But it'll be always much more complicated and also *much slower* than using a proper regex (which needs just a few dozen cycles per char). – maaartinus May 15 '15 at 20:02
  • @AlanMoore I'd also reopen this question as it's much more specific and the linked general answer is rather non-trivial. Here, the laziness causes catastrophic backtracking, but it can happen without laziness, too. – maaartinus May 15 '15 at 20:08

1 Answers1

2

The simplest way would be to split on #011. If you want a regex, you can indeed negate the string, but that's complicated. I'd go for an atomic group

(?>(.+?)#011)

Once matched, there's no more backtracking. Done and looking forward for the next group.

Negating a string

The complement of #011 is anything not starting with a #, or starting with a # and not followed by a 0, or starting with the two and not followed... you know. I added some blanks for readability:

 ((?: [^#] | #[^0] | #0[^1] | #01[^1] )+) #011

Pretty terrible, isn't it? Unlike your original expression it matches newlines (you weren't specific about them).

An alternative is to use negative lookahead: (?!#011) matches iff the following chars are not #011, but doesn't eat anything, so we use a . to eat a single char:

 ((?: (?!#011). )+)#011

It's all pretty complicated and most probably less performant than simply using the atomic group.

Optimizations

Out of my above regexes, the first one is best. However, as Casimir et Hippolyte wrote, there's a room for improvements (factor 1.8)

( [^#]*+ (?: #(?!011) [^#]* )*+ ) #011

It's not as complicated as it looks. First match any number (including zero) of non-# atomically (the trailing +). Then match a # not followed by 011 and again any number of non-#. Repeat the last sentence any number of times.

A small problem with it is that it matches an empty sequence as well and I can't see an easy way to fix it.

Community
  • 1
  • 1
maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • The regex for atomic group doesn't seem to work. Try to match/extract this string "(empty)#011". – Joe Nate May 15 '15 at 19:44
  • @JoeNate Thanks, fixed. The problem was '<' (lookbehind-like nonsense) instead of `>'. – maaartinus May 15 '15 at 19:53
  • You are a life saver, thanks a bunch :) I was trying "^". Now the performance landmine is gone! :) Post a "gift a beer" link, if you have one :D – Joe Nate May 15 '15 at 20:09
  • When you use an advanced regex engine, there is no need to use subpatterns designed for posix like:`(([^#] | #[^0] | #0[^1] | #01[^1] )+) #011` or this inefficient workaround: `((?: (?!#011). )+)#011` (note the number of useless tests with this way). You can use something like this: `[^#]*+(?:#(?!011)[^#]*)*+` that is from far more performant. However the first way can be useful and is good to know for mysql, sed or grep. – Casimir et Hippolyte May 15 '15 at 21:56
  • @CasimiretHippolyte I can't see the number of useless tests, can you elaborate? What I see is "if not looking at a hash sign, then eat a char, else do something more complicated", which should be pretty fast. +++ I'm also lost concerning the far more performant solution as I'd go for my very first regex instead (my reason for the negation section was actually assuming I can't use atomic groups). – maaartinus May 15 '15 at 22:05
  • @maaartinus: Take a look here: https://regex101.com/r/fN8xN8/1 (see the versions 1 to 4, look at the number of steps and use the debugger to see what happens). Obviously the number of steps is only indicative but if you benchmark these patterns with large strings (or repeated a lot of times) you will see a real difference. – Casimir et Hippolyte May 15 '15 at 22:24
  • @CasimiretHippolyte Nice site, indeed! Actually, I knew the trick you used (`simple*+ (complex simple*+)*`), but I hoped, my simple solution could be competitive. Concerning the number of steps, you win by a factor of nearly 2 and that's not bad! – maaartinus May 15 '15 at 23:22
  • The possessive quantifiers `*+` are only here to prevent backtracking when there is not `#011` after, but the main idea is to limit the use of the lookahead (only when it is needed, so only after a `#`). – Casimir et Hippolyte May 15 '15 at 23:40