I'm trying to parse JSON object from text with python regex. I found this match:
'\{(?:[^{}]|(?R))*\}'
but in python I get this error:
re.error: unknown extension ?R at position 12
See the regex match in this regex101 example.
I'm trying to parse JSON object from text with python regex. I found this match:
'\{(?:[^{}]|(?R))*\}'
but in python I get this error:
re.error: unknown extension ?R at position 12
See the regex match in this regex101 example.
You found a regex that uses syntax that Python standard library re
module doesn't support.
When you look at the regex101 link, you'll see that the pattern works when using the PRCE library, and the problematic (?R)
syntax that throws the error uses a feature called recursion. That feature is only supported by a subset of regex engines.
You could install the regex
library, an alternative regex engine for Python that explicitly does support that syntax:
>>> import regex
>>> pattern = regex.compile(r'\{(?:[^{}]|(?R))*\}')
>>> pattern.findall('''\
... This is a funny text about stuff,
... look at this product {"action":"product","options":{...}}.
... More Text is to come and another JSON string
... {"action":"review","options":{...}}
... ''')
['{"action":"product","options":{...}}', '{"action":"review","options":{...}}']
Another option is to just try and decode any section that starts with {
using the JSONDecoder.raw_decode()
method; see How do I use the 'json' module to read in one JSON object at a time? for an example approach. While the recursive regex can find JSON-like text, the decoder approach would let you extract only valid JSON text.
Here is a generator function that does just that:
from json import JSONDecoder
def extract_json_objects(text, decoder=JSONDecoder()):
"""Find JSON objects in text, and yield the decoded JSON data
Does not attempt to look for JSON arrays, text, or other JSON types outside
of a parent JSON object.
"""
pos = 0
while True:
match = text.find('{', pos)
if match == -1:
break
try:
result, index = decoder.raw_decode(text[match:])
yield result
pos = match + index
except ValueError:
pos = match + 1
Demo:
>>> demo_text = """\
This is a funny text about stuff,
look at this product {"action":"product","options":{"foo": "bar"}}.
More Text is to come and another JSON string, neatly delimited by "{" and "}" characters:
{"action":"review","options":{"spam": ["ham", "vikings", "eggs", "spam"]}}
"""
>>> for result in extract_json_objects(demo_text):
... print(result)
...
{'action': 'product', 'options': {'foo': 'bar'}}
{'action': 'review', 'options': {'spam': ['ham', 'vikings', 'eggs', 'spam']}}
If there is only one JSON in one line, you can use the index methods to find the first and the last bracket to select the JSON:
firstValue = jsonString.index("{")
lastValue = len(jsonString) - jsonString[::-1].index("}")
jsonString = jsonStringEncoded[firstValue:lastValue]
Thas is because python re
module is pretty weak and do not support subroutines and recursion. Try pypi regex
module instead. It will compile your regex.