1

shortend URL with my current regex in regexpal: http://bit.ly/1jbOFGd

I have a line of key=value pairs, space delimited. Some values contain spaces and punctuation so I do a positive lookahead to check for the existence of another key.

I want to tokenize the key and value, which I later convert to a dict in python.

My guess is that I can speed this up by getting rid of .*? but how? In python I convert 10,000 of these lines in 4.3 seconds. I'd like to double or triple that speed by making this regex match more efficient.

Jared
  • 607
  • 1
  • 6
  • 28

2 Answers2

2

Update:

(?<=\s|\A)([^\s=]+)=(.*?)(?=(?:\s[^\s=]+=|$))

I would think this one is more efficient than yours (even though it still uses the .*? for the value, its lookahead is no where near as complex and doesn't use a lazy modifier), but I'll need you to test. This does the same as my original expression, but handles values differently. It uses a lazy .*? match followed by a lookahead that is either a space, followed by a key, followed by a = OR the end of the string. Notice I always define a key as [^\s=]+, so keys cannot contain an equal sign or whitespace (being this specific helps us avoid lazy matches).

Source


Original:

Are there some rules I am missing that you need by doing something this simple?

(?<=\s|\A)([^=]+)=([\S]+)

This starts with a lookbehind of either a space character (\s) or the beginning of the string (\A). Then we match everything except =, followed by a =, and match everything except whitespace (\s).

Sam
  • 20,096
  • 2
  • 45
  • 71
  • Just noticed that you can have spaces in your values (not sure why you would let this with space delimiters), let me work on my answer. I'm assuming there at least can't be spaces in the keys.. – Sam Apr 04 '14 at 20:36
  • many thanks for what you've provided thus far! yes, spaces and punctuation in the values, cannot help that :( however, never any spaces in the keys, so that's good. – Jared Apr 04 '14 at 23:52
  • @jared sorry to ask a possibly redundant question, but is everything working alright? Spaces and punctuation can be in values (not keys, they can be anything but whitespace and `=`) with the updated answer. Can you test that? – Sam Apr 05 '14 at 00:01
  • trying to get it working. getting a python error "look-behind requires fixed-width pattern" so troubleshooting that now. btw, I LOVE regex101, great find. – Jared Apr 05 '14 at 00:16
  • Thats weird..the lookbehind is fixed (either `\s|\A`), the lookahead isn't fixed but that shouldn't matter. – Sam Apr 05 '14 at 00:18
  • ok, so the \s|\A in the beginning is the culprit. I guess it could be 0 or 1. I'm getting faster results with your regex. Here's my modification: http://regex101.com/r/nH6wJ0 – Jared Apr 07 '14 at 14:05
  • 1
    I'm down to 1.5s to process 10,000 lines. Adding threading/multiprocessing will reduce it from there. Thanks for your help Sam, you rock. – Jared Apr 07 '14 at 14:32
  • It may be worth reading an article on how [regex performance](http://blog.codinghorror.com/regex-performance/) can be killed by *catastrophic backtracking* and what you can do. Please select my answer if it helped :) – Sam Apr 07 '14 at 14:36
1

"Lookbehind" (related to 'lookahead' and 'lookaround') is the key 'regular expression' concept to read up on here - it let's you match and skip individual components of the string.

Good examples here: http://www.rexegg.com/regex-lookarounds.html.

Mosca Pt
  • 517
  • 4
  • 10