Optimize a Regex

Question

I'm using the following code to discard unsupported physical interfaces / subinterfaces from routers that connects to a big ISP network (by big I mean tens of thousands of routers):

private final static Pattern INTERFACES_TO_FILTER = 
   Pattern.compile("unrouted VLAN|GigabitEthernet.+-mpls layer|FastEthernet.+-802\\.1Q vLAN subif"); 

// Simplification
List<String> interfaces;
// lots of irrelevant code to query the routers 

for (String intf : interfaces) {
   if (INTERFACES_TO_FILTER.matcher(intf).find()) {
      // code to prevent the interface from being used
   } 
}

The idea is discarding entries such as:

unrouted VLAN 2000 for GigabitEthernet2/11.2000
GigabitEthernet1/2-mpls layer
FastEthernet6/0/3.2000-802.1Q vLAN subif

This code is hit often enough (several times per minute) over huge sets of interfaces (some routers have 50k+ subintefaces), cache doesn't really help much either because new subinterfaces are being configured / discarded very often. The plan is to optimize the regex so that the procedure completes a tad faster (every nanosecond counts). Can you guys enlighten me?

Note: mpls layer and 802.1Q are supported for other kinds of interfaces, unrouted VLANs isn't.

why not first isolate the string that should be matched and check that — ratchet freak, Dec 29 '11 at 18:29
Have you tried just using String.contains rather than the regex? — JB Nizet, Dec 29 '11 at 18:30
Those entries that you've provided are full lines or some words taken from the middle of line? — piotrekkr, Dec 29 '11 at 18:38
@ratchet, actually I'm doing this, I will update the code to reflect the fact, but this 'if' is inside a 'for' iterating over each interface. — Anthony Accioly, Dec 29 '11 at 18:38
@JB, Would it be worth discarding a single regex for 5 contains? `(contains 'unrouted vlan' || (contains 'GigabitEthernet' && constains '-mpls layer') || (contains 'FastEthernet' && 'vLAN subif')`? — Anthony Accioly, Dec 29 '11 at 18:41
I don't know, but it's easy enough to test if it's worth or not. You could also use indexOf and only test that "-mpls ayer" is found after the index of 'GigabitEthernet' + 15 (15 being the length of 'GigabitEthernet') — JB Nizet, Dec 29 '11 at 18:43
then wouldn't `matches()` work as well this doesn't try at every position (only 1 match is attempted instead of `intf.length()`) — ratchet freak, Dec 29 '11 at 18:54
`Pattern.matches` was actually slower than `matcher().find`. The only significant difference is that for matches i used `unrouted VLAN.+`, I don't see the reason for it to be slower than find, but it actually is (about 2 times slower). — Anthony Accioly, Dec 29 '11 at 19:13

Voo · Answer 1 · 2011-12-29T19:09:27.873

2

There are some string search algorithms that allow you to try to search in a string of length n for k strings at once cheaper than the obvious O(n*k) cost.

They usually compare a rolling hash against a list of existing hashes of your words. A prime example of this would be the Rabin-Karp algorithm. The wiki page even has a section about this. There are more advanced versions of the principle out there as well, but it's easy to understand the principle.

No idea if there already are libraries in Java that do this (I'd think so), but that's what I'd try - although 5 strings is rather small here (and different size makes it more complex too). So better check whether a good KMP string search isn't faster - I'd think that'd be by far the best solution really (the default java api uses a naive string search, so use a lib)

About your regexes: backtracking regex implementation for performance critical search code? I doubt that's a good idea.

PS: If you'd post a testset and a test harness for your problem, chances are good people would see how much they could beat the favorite - has worked before.. human nature is so easy to trick :)

edited Dec 29 '11 at 19:09

answered Dec 29 '11 at 18:54

Voo

29,040
11
82
156

Thanks @Voo. I will dig into it. As for the test harness, not sure what I can do, maybe iterate 5.000.000 times over those tree Strings and record time with `System.nanotime()`:D. I'm not good with benchmarks lol. If i don't feel satisfied with it until tomorrow I will update the code with the test lol :D. – Anthony Accioly Dec 29 '11 at 19:18
@Anthony Yeah writing benchmarks in java is not easy. [caliper](http://code.google.com/p/caliper/) ought to make it easier, but I haven't used it. [This](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) and one of Cliff's [presentations](http://www.azulsystems.com/events/javaone_2009/session/2009_J1_Benchmark.pdf) should help you if you want a sensible benchmark. – Voo Dec 29 '11 at 20:09
For the harness in general I'd propose: An interface `boolean valid(String s)` that is implemented by the test candidates and you then just check some typical input strings (and check the result!). Simplest is to read the input from a file you provide seperately - should be representable of your usual data after all so we can't just generate random strings there. In general: **Never optimize something you can't measure** so this really ought to be the first thing you do.. don't worry we'll point out some problems in your first version if you post it ;-) – Voo Dec 29 '11 at 20:14
Hahahha... That's why I don't write benchmarks... I'm using a profiler to measure my own code :D. – Anthony Accioly Dec 29 '11 at 20:25
@Anthony That works too, although in my experience comparing runs of larger applications to each other is rather problematic - ie same code, same data will run 20% faster or slower from one call to the other - that's the problem with JITing, OSR and so on. But for large differences using the profiler should give a somewhat accurate picture yes. – Voo Dec 30 '11 at 01:07

score 2 · Accepted Answer · answered Dec 29 '11 at 21:14

I'm answering my own question for further reference, although the credits goes to @piotrekkr since he was the one that pointed the way. Also my Kudos to @JB and @ratchet. I ended up using matches(), and the logic using indexOf and several contains was almost as fast (that's news to me, I always assumed that a single regex would be faster than several calls to contains).

Here's a solution that is several times faster (according to the profiler, about 7 times less time is spent at Matcher class methods):

^(?:unrouted VLAN.++|GigabitEthernet.+?-mpls layer|FastEthernet.+?-802\\.1Q vLAN subif)$

score 1 · Answer 3 · answered Dec 29 '11 at 19:13

1

If your problem is that you have a number of long string constants you're searching for, i would recommend using a Java analog of the standard C tool "lex".

A quick googling took me to JFlex. I haven't used this particular tool and there may be others available, but that is an example of the kind of tool i would look for.

answered Dec 29 '11 at 19:13

theglauber

28,367
7
29
47

Well lex would just create a FSM for him in the end and well.. the regex is doing that already.. – Voo Dec 29 '11 at 20:07
Good point. I would like to use a lex tool for 2 reasons: (1) I would hope the tool would do a better job of optimizing the FSM than a regexp (because it can spend more time doing the optimization). (2) The lex file gives me a more readable way to organize my search strings, compared to a long regex with many |s. – theglauber Dec 29 '11 at 20:57

piotrekkr · Answer 4 · 2011-12-29T19:36:10.190

1

If you must use regex for this try changing to this one:

^(?:unrouted VLAN)|(?:GigabitEthernet.+?-mpls layer)|(?:FastEthernet.+?-802\.1Q vLAN subif)

^ make engine match from begining of string, not anywhere in string

.+? makes + ungreedy

(?:...) makes () non-capturing group

edited Dec 29 '11 at 19:36

answered Dec 29 '11 at 19:22

piotrekkr

2,785
2
21
35

Optimize a Regex

4 Answers4