2

I'm trying to parse JSON object from text with python regex. I found this match:

'\{(?:[^{}]|(?R))*\}'

but in python I get this error:

re.error: unknown extension ?R at position 12

See the regex match in this regex101 example.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Stephan
  • 61
  • 1
  • 1
  • 4
  • That's because that's not syntax supported by the Python regex parser. There is no universal regex syntax standard supported by all engines. – Martijn Pieters Jan 17 '19 at 12:04
  • 1
    Bottom line: don't just copy regex patterns from random locations and expect them to work in random regex engines. At a minimum [educate yourself a little about regex](https://www.regular-expressions.info/). Recursion is supported in [Perl, Ruby and and languages that use the PRCE library](https://www.regular-expressions.info/recurse.html), other languages need to use 3rd-party libraries, if available. – Martijn Pieters Jan 17 '19 at 12:09

3 Answers3

37

You found a regex that uses syntax that Python standard library re module doesn't support.

When you look at the regex101 link, you'll see that the pattern works when using the PRCE library, and the problematic (?R) syntax that throws the error uses a feature called recursion. That feature is only supported by a subset of regex engines.

You could install the regex library, an alternative regex engine for Python that explicitly does support that syntax:

>>> import regex
>>> pattern = regex.compile(r'\{(?:[^{}]|(?R))*\}')
>>> pattern.findall('''\
... This is a funny text about stuff,
... look at this product {"action":"product","options":{...}}.
... More Text is to come and another JSON string
... {"action":"review","options":{...}}
... ''')
['{"action":"product","options":{...}}', '{"action":"review","options":{...}}']

Another option is to just try and decode any section that starts with { using the JSONDecoder.raw_decode() method; see How do I use the 'json' module to read in one JSON object at a time? for an example approach. While the recursive regex can find JSON-like text, the decoder approach would let you extract only valid JSON text.

Here is a generator function that does just that:

from json import JSONDecoder

def extract_json_objects(text, decoder=JSONDecoder()):
    """Find JSON objects in text, and yield the decoded JSON data

    Does not attempt to look for JSON arrays, text, or other JSON types outside
    of a parent JSON object.

    """
    pos = 0
    while True:
        match = text.find('{', pos)
        if match == -1:
            break
        try:
            result, index = decoder.raw_decode(text[match:])
            yield result
            pos = match + index
        except ValueError:
            pos = match + 1

Demo:

>>> demo_text = """\
This is a funny text about stuff,
look at this product {"action":"product","options":{"foo": "bar"}}.
More Text is to come and another JSON string, neatly delimited by "{" and "}" characters:
{"action":"review","options":{"spam": ["ham", "vikings", "eggs", "spam"]}}
"""
>>> for result in extract_json_objects(demo_text):
...     print(result)
...
{'action': 'product', 'options': {'foo': 'bar'}}
{'action': 'review', 'options': {'spam': ['ham', 'vikings', 'eggs', 'spam']}}
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • If it's useful, I Frankenstein'ed `extract_json_objects()` to also provide the surrounding text, so that a user could choose to json-prettify the objects along with the string which contained it. [Adam can be found here](https://stackoverflow.com/a/61384796/1431750). – aneroid Apr 23 '20 at 10:37
3

If there is only one JSON in one line, you can use the index methods to find the first and the last bracket to select the JSON:

firstValue = jsonString.index("{")
lastValue = len(jsonString) - jsonString[::-1].index("}")
jsonString = jsonStringEncoded[firstValue:lastValue]
0

Thas is because python re module is pretty weak and do not support subroutines and recursion. Try pypi regex module instead. It will compile your regex.

Superluminal
  • 947
  • 10
  • 23