regex backtracking entire word

Question

I have a string that is a list of key value pairs that looks like:

Key=key1,Value=value1 Key=key2,Value=value2

Some of my values have braces around them, so I could have something like:

Key=key1,Value={"a":"b"}, Key=key2,Value=value2, Key=key3,Value={"c":{"d":"e"}}

I want to create a regex to match just the values that are in braces. The regex I currently have is {[^=]*} , and that works if none of the values have = in them.

This will break it:

Key=key1,Value={"a=":"b"}, Key=key2,Value=value2, Key=key3,Value={"c":{"d":"e"}}

I tried changing my regex to be {[^(Key=)]*} , but that did not match it.

If I can assume Key= is the start of a new key and will not appear in regex values, how can I modify my regex to match this?

Because you have nested braces you've now gone into parser land. Regex isn't well-suited for locating nested things. — Kenneth K., Dec 24 '15 at 17:37
In addition to what @KennethK. said, this is why people shouldn't come up with custom storage formats. Use JSON or something standard. Then you don't have these problems. — jpmc26, Dec 24 '15 at 19:09
@Kenneth K. Regex(in terms of programming) can produce CFG solutions, therefore its not true that its not well-suited. Please check references in my answer — Ferit, Dec 25 '15 at 09:12
@Saibot Not exactly. You're making use of a construct ("recursive patterns") that does not exist in all regex flavors. It appears that you can only use that syntax if you install an additional library in Python (what the OP tagged the question with). — Kenneth K., Dec 27 '15 at 03:17
@Kenneth K. I didn't claim that all regex flavor can do CFG's, but there are regex flavors which can do it(PCRE for example). So your answer doesn't refute my claim. — Ferit, Dec 27 '15 at 03:22
`Regex(in terms of programming) can produce CFG solutions` is a general statement. For the inexperienced, your comment makes it sound like, "here do this. it will always work." Your claim is not properly framed. — Kenneth K., Dec 27 '15 at 03:23
@KennethK. That's your comment. I didn't claim every flavor is handling CFG'S. Is it really that hard to accept you were wrong while saying regexes don't work? Is it really that hard to accept my answer is better? Please, get over it. — Ferit, Dec 27 '15 at 22:46
@Jeff Storey Hello, did you check my answer with regex? I think it's a better solution. — Ferit, Dec 27 '15 at 22:48
@jpmc26 I'm not trying to come up with a custom storage format. The Value will always be either a plain string or JSON, I'm just trying to pull out the JSON from a key value string. Unfortunately I don't have much control over that string coming in. — Jeff Storey, Dec 30 '15 at 21:21
Sorry. I meant someone else did. Whoever formatted it that way should have used JSON or similar. — jpmc26, Dec 30 '15 at 21:43

Amal Murali · Accepted Answer · 2015-12-24T15:52:07.443

3

The problem with your current regex is that you can't peform negations with strings inside a character class. It's a character class. So this won't work as you would expect: {[^(Key=)]*} - it matches any string that contains characters that are not (, K, e, y =, ), zero or more times, but you want it to match any strings that are not Key= instead.

You can use a different approach with recursion to accomplish what you need:

{(([^{}]|(?R))*)}

Demo

edited Dec 24 '15 at 15:52

answered Dec 24 '15 at 15:34

Amal Murali

75,622
18
128
150

That seems to leave off the last } in the nested expression. Will making the last } a }+ fix the issue? – Jeff Storey Dec 24 '15 at 15:36
@JeffStorey: I don't think Python supports recursive regex, that's probably why you saw an error. You need to use a package like `regex`. If all keys follow the same structure, I don't see why you need to specifically look for `Key=`. Why not just take everything inside the outermost curly braces? – Amal Murali Dec 24 '15 at 15:53
Thanks for the explanation on the Negation. Let me test out some of the recursive regex. – Jeff Storey Dec 30 '15 at 21:23

jpmc26 · Answer 2 · 2015-12-24T19:06:40.487

Forget regex. Accomplishing what you want to do with a regex is going to be error prone and unreliable. You're always going to have little edge cases that you can't really handle well with a regex.

What you really need is a context free grammar. Use pyparsing.

>>> from pyparsing import OneOrMore, Regex, Optional
>>> pairListParser = OneOrMore(u'Key=' + Regex(u'[^,]+') + u',Value=' + Regex(u'[^, ]+') + Optional(Regex(u',? ')))
>>> x = u'Key=key1,Value={"a=":"b"}, Key=key2,Value=value2, Key=key3,Value={"c":{"d":"e"}}'
>>> pairListParser.parseString(x, parseAll=True)
([u'Key=', u'key1', u',Value=', u'{"a=":"b"}', u', ', u'Key=', u'key2', u',Value=', u'value2', u', ', u'Key=', u'key3', u',Value=', u'{"c":{"d":"e"}}'], {}

Note that in the example above, I assumed that keys cannot contain a comma (,) and that values cannot contain a comma (,) or a space (). I did so for simplicity, but with pyparsing, you can rework the parser to allow for those cases. It's just a matter of doing the work to figure it out, whereas with regex, it is mathematically impossible to parse it if those restrictions don't apply.

Then you just need to pull out the results.

>>> parsedX = pairListParser.parseString(x, parseAll=True)
>>> parsedXIter = iter(i for i in parsedX if i not in (u'Key=', u',Value=', u', '))
>>> result = dict(zip(parsedXIter, parsedXIter))
>>> result
{u'key3': u'{"c":{"d":"e"}}', u'key2': u'value2', u'key1': u'{"a=":"b"}'}

(There are probably better ways to pull out the results, but this was quick and dirty. Noteably, pyparsing has capabilities that let you discard certain elements or transform the results while it parses.)

Once you have the results in a dict, you can do whatever you want with the values:

for k, v in result.items():
     m = re.match(u'^{(.+)}$', v)
     if m:
         print(m.groups())

I imagine it would be better to parse them as JSON or something like that, but the point is now you've cut off all the stuff around the value and can work with just the value in isolation.

Regexes(in terms of programming) cover CFG's as well, so you can write CFG solution in regex(again in terms of programming) — Ferit, Dec 24 '15 at 19:47
@Saibot As defined mathematically, they do not. It's true that with the extensions most programming languages have nowadays (like back references), they cover more than just regular languages, but I'm not convinced they cover the entirety of context free languages. Even if they do, regexes that complex are more likely to make your head spin than to help. If you need a context free grammar, just use one. It simplifies the problem (define each token, define how tokens fit together) and results in much more comprehensible code when you have excellent libraries like `pyparsing`. — jpmc26, Dec 24 '15 at 20:25
I hope these references convince you about my claim: Regexes(in terms of programming) covers CFG's Please see this: http://stackoverflow.com/questions/3644266/how-can-we-match-an-bn-with-java-regex/3644267#3644267 and https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html — Ferit, Dec 24 '15 at 21:24

score 2 · Answer 3 · edited May 23 '17 at 12:33

2

Simply use the regex below

Value=({[^{}]*(?1)?})

Demo: https://regex101.com/r/pJ8lO9/2

Explanation:

You need a CFG, and you can get a CFG solution using regex(in terms of programming)

For further reading about this claim please check: How can we match a^n b^n with Java regex? and https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html

As this pattern needs to match balanced curly bracket, which forms:

a^n b^n

And as n is arbitrary, regex(in terms of maths) cannot solve this. We need a CFG. And the regex(in terms of programming) solution for this is:

(a(?1)?b)

This is a recursive pattern. '(?1)' recurses first capturing group: '(a(?1)?b)'. And the '?' is to avoid infinite recursion. '(a(?1)b)' would recurse infinitely. So '(?1)' has two options, '(a(?1)?b)', or empty. In CFG notation, it's represented as:

(?1) -> a(?1)b | ε

Back to our solution. 'a' repsesents '{' and 'b' represents '}', so

({(?1)?})

and we need to put values inside brackets:

({[^{}]*(?1)?})

and decorating it with 'Value='

Value=({[^{}]*(?1)?})

edited May 23 '17 at 12:33

Community

1
1

answered Dec 24 '15 at 21:32

Ferit

8,692
8
34
59

For the regex, this matches this particular example, but if I add more nesting to it, like `Key=key1,Value={"a=":"b"}, Key=key2,Value=value2, Key=key3,Value={"c":{"d":"e"},"d":{"e":"f"}}`, it no longer grabs the value for key3 – Jeff Storey Dec 30 '15 at 21:20
You want to match 'Value={"c":{"d":"e"},"d":{"e":"f"}}' right? – Ferit Dec 30 '15 at 22:11
Yes, that is correct, but there will not be single quotes around Value (out of my control). It looks like the regex that @amalmurali provided works for all of these cases. – Jeff Storey Dec 31 '15 at 15:59
@JeffStorey I didn't know that you would have these type of inputs. After your comment, I worked out, and my result was the same with Amal Murali. His answer is probably shortest and easiest possible, I couldn't find any better. But anyway, I think my explanation about CFG grammars, and how to build these types of regexes is enough contribution to not to delete the answer, what do you think? :) – Ferit Jan 02 '16 at 05:07

regex backtracking entire word

3 Answers3