2

I am trying to retrieve all JSON like dictionaries from a long string. For example,

{"uri": "something"} is referencing {"link": "www.aurl.com"}

I want to get {"uri": "something"} and {"link": "www.aurl.com"} as result. Is there a way to do this through regex in python?

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
Jie Zhang
  • 75
  • 2
  • 8
  • 1
    `re.findall(r'\{[^}]*\}', s)` – Avinash Raj Sep 15 '15 at 16:05
  • @AvinashRaj: The flaw with that approach is that it can't handle nested objects. If the string was `{"uri": {"domain": "example.com", "protocol": "https"}, "foo": "bar"} is referencing {"link": "www.aurl.com"}`, your first capture would omit the `, "foo": "bar"}`, leaving you unparseable partial JSON as a result. – ShadowRanger Sep 15 '15 at 16:21

1 Answers1

1

Probably the "nicest" way to do this is to let a real JSON decoder do the work, not using horrible regexes. Find all open braces as "possible object start points", then try to parse them with JSONDecoder's raw_decode method (which returns the object parsed and number of characters consumed on success making it possible to skip successfully parsed objects efficiently). For example:

import json

def get_all_json(teststr):
    decoder = json.JSONDecoder()
    # Find first possible JSON object start point
    sliceat = teststr.find('{')
    while sliceat != -1:
        # Slice off the non-object prefix
        teststr = teststr[sliceat:]
        try:
            # See if we can parse it as a JSON object
            obj, consumed = decoder.raw_decode(teststr)
        except Exception:
            # If we couldn't, find the next open brace to try again
            sliceat = teststr.find('{', 1)
        else:
            # If we could, yield the parsed object and skip the text it was parsed from
            yield obj
            sliceat = consumed

This is a generator function, so you can either iterate the objects one by one e.g. for obj in get_all_json(mystr): or if you need them all at once for indexing, iterating multiple times or the like, all_objs = list(get_all_json(mystr)).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • Depending on your data, you may need to adjust the decoder configuration. For example, strict JSON forbids control characters, e.g. tabs, line feeds, etc., in strings, but sometimes real JSON has them. Passing `strict=False` to the `JSONDecoder` constructor will allow strings of that form to be parsed. Much more powerful customizations are also available, see [the `JSONDecoder` docs](https://docs.python.org/3/library/json.html#json.JSONDecoder). – ShadowRanger Sep 15 '15 at 16:24
  • If you just want the strings, rather than the decoded dictionaries, you can change `yield obj` to `yield teststr[:consumed]`. While this makes the `obj` generation a "waste", it's still much better to have a real JSON parser doing the work, rather than rolling your own, invariably bad replacement with regular expressions. – ShadowRanger Sep 15 '15 at 16:33
  • Thank you ShadowRanger for the detail explanation. However the string I am parsing is known not to have any nested JSON. I will try Avinash's approach first, but will definitely keep your solution in mind. :) – Jie Zhang Sep 15 '15 at 17:14
  • 1
    Please don't. Reinventing parsers unnecessarily is a cause of serious pain; [JSON is not a regular language](https://cstheory.stackexchange.com/questions/3987/is-json-a-regular-language); even though Python's regex dialect can handle more than a true regex engine, it's still nigh impossible to get it 100% correct. [Like HTML, JSON cannot be parsed with regular expressions](https://stackoverflow.com/a/1732454/364696). Example failure: `{"uri": "http://example.com/?res_id={3F2504E0-4F89-41D3-9A0C-0305E82C3301}"}` will fail even w/o nested JSON objects because the string itself has a close brace. – ShadowRanger Sep 15 '15 at 17:28