Regex optimization question

Question

As a personal learning exercise, I wrote this regex to split a unary string into parts whose length is increasing powers of two (see also on ideone.com):

    for (String s :
       new String(new char[500])
       .split("(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2\\2.\\1)^.*)")
    ) {
       System.out.printf("%s ", s.length());
    }
    // prints "1 2 4 8 16 32 64 128 245 "

This uses a combination of capturing during lookarounds, nested lookarounds, matching on backreferences, and infinite length lookbehind (which isn't officially supported in Java but works anyway). Properties of sums of powers of two and the fact that the string has a unary alphabet is also taken advantage of.

This solution is both unreadable and has a horrible performance.

My question is, how would you "optimize" this regex?

Can you make the regex more readable (it's okay if it performs worse)
Can you make the regex performs better (it's okay if it's less readable)

I consider playing with regexes to be fun, but this is totally masochistic — Amarghosh, Jul 07 '10 at 09:41
@Amargosh: it was frustratingly painful to write, until I got it to work. Then it became hedonistic. — polygenelubricants, Jul 07 '10 at 09:58
How horrible is its performance on Java? On .NET, it splits a 10k letter string within 4 seconds. — Jens, Jul 07 '10 at 10:13
@Jens: my Java solution splits 666 in 4 seconds (http://ideone.com/LCnQO). With e.g. 800, it exceeds time limit allowance on ideone.com. I'm beginning to think that I need to switch to .NET to play with regex. — polygenelubricants, Jul 07 '10 at 11:02
Trying to decipher this... can you explain the `\2\2.\1` part? Since the \2 is anchored to start, what does the doubling do? — Peter Boughton, Jul 07 '10 at 11:03
@Peter: `^\2\2.\1$`, with `\1` = `\G.*$` and `\2` = `^.*\G`, so the `\2.` middle part accomplishes the same as Jens' `(?<=\1.)`. The doubling and plus one uses identities of summation of powers of two. And yes, I do plan to explain my solution eventually. I purposedly didn't because I want others to try something entirely new if possible. I'm just providing it here so that people see this is possible at least one way. — polygenelubricants, Jul 07 '10 at 11:07
@Peter: example of arithmetic involved: say we already split into parts of length 1,2,4. Then, 1+2+4=7. The next part must have length 8=7+1. That's 7+7+1 away from the beginning. — polygenelubricants, Jul 07 '10 at 11:13
Ah, thanks, that makes sense now. :) I'll have a think and see if I can come up with another way, though I suspect I'll fail. — Peter Boughton, Jul 07 '10 at 11:28
@polygenelubricants: Yeah, for playing with regex .NET is probably more satisfying. I often see regex questions here on SO and think "Well, thats easy using .NET... now for your hands-tied-behind-your-back-regex-implementation..." =) — Jens, Jul 07 '10 at 13:04
Let's see: unrestricted lookbehinds; reusable group names; intermediate captures; right-to-left matching; depth-tracking recursive subexpressions... Yep, I'll expect to see a proof of Fermat's Last Theorem in regex form by the end of the week. ;) — Alan Moore, Jul 08 '10 at 12:34

Jens · Answer 1 · 2010-07-07T10:46:21.993

I am not a Java guy, therefore my answers is based on the .NET Regex implementation. I used

"(?<=(^.*)\\G.*)(?<=\\G\\1.)"

based on the fact that \sum_{i=0}^{n} 2^n = 2^{n+1} - 1. It basically reads "Match every position, for which the part after the last match is one longer than the part before the last match."

Its about twice as fast as your original (on .NET, again), taking less than 2 seconds to split 10.000 characters, and I'd argue that its a little more readable. Well... less unreadable. =)

Cheers! Nice question! =)

Edit: Looking at your Regex again, it seems to me that you are using the same approach, but in a more complicated manner. I admit that I have not tried to read yours before trying to make my own solution, both because I like a challenge and because your regex is quite unreadable. =) Are these nested lookarounds neccessary because of the Java regex engine?

Yep, this is exactly the same matching/arithmetic logic that I have, but more concise because .NET officially supports infinite lookbehind. This doesn't compile in Java (http://ideone.com/wcDRQ). +1 for effort, though. Java also doesn't support backreferences in lookbehind (see: http://stackoverflow.com/questions/2734977/backreferences-in-lookbehind), but you can use it in a lookahead nested inside a lookbehind. See? NOW we're all learning!! =) — polygenelubricants, Jul 07 '10 at 10:35

score 1 · Answer 2 · answered Jul 07 '10 at 09:31

1

I really wouldn't. I'd throw the whole thing away and redo it as readable procedural code.

There are some things you really shouldn't do with regular expressions. This is one of them. I'm all for educating yourself but do you really think that's going to come in handy at some point?

Perhaps you'd be better off learning something that will actually be usable and maintainable :-)

answered Jul 07 '10 at 09:31

paxdiablo

854,327
234
1,573
1,953

2

Should be a comment, but programming isn't all about making money, it's also about having fun. Learning is fun for me. – polygenelubricants Jul 07 '10 at 09:33
And what did you learn today? I learnt (or more correctly, re-enforced my belief) that REs are a good tool but they're not fit for everything. – paxdiablo Jul 07 '10 at 09:35
Hopefully I'll learn from answers, not comments. I'm hoping Alan Moore will come and save the day. Or perhaps I'll have my own breakthrough. – polygenelubricants Jul 07 '10 at 09:37
2

What *I've* learned so far is that I haven't had enough coffee yet. :D – Alan Moore Jul 07 '10 at 15:06

polygenelubricants · Answer 3 · 2010-07-07T11:34:47.197

These are patterns that have worked for me in Java. I'll eventually revise everything into one comprehensive answer with FULL explanations. These are all Java string literal representations.

P000: "(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2\\2.\\1)^.*)"
- Original pattern
- 500 in 2.23s (ideone.com), 666 in 4.1s (ideone.com)
P001: "(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2.\\1)\\G.*)"
- Essentially the same as P000, but instead of ^\2\G\2.\1, we cut off the head to just \G\2.\1
- 500 in 2.21s (ideone.com)
P002: "(?=(.*))(?<=(?<=(^.*))(?=\\2.\\1)\\G.*)"
- Much slower than P000 but shorter
- Essentially a "refactoring" of P001 now that both lookbehinds are anchored at \G
- 400 in 3.05s (ideone.com)

Regex optimization question

3 Answers3