4

I have a string that can be one of two forms:

name multi word description {...}

or

name multi word description [...]

where {...} and [...] are any valid JSON. I am interested in parsing out just the JSON part of the string, but I'm not sure of the best way to do it (especially since I don't know which of the two forms the string will be). This is my current method:

import json

string = 'bob1: The ceo of the company {"salary": 100000}' 
o_ind = string.find('{')
a_ind = string.find('[')

if o_ind == -1 and a_ind == -1:
    print("Could not find JSON")
    exit(0)

index = min(o_ind, a_ind)
if index == -1:
    index = max(o_ind, a_ind)

json = json.loads(string[index:])
print(json)

It works, but I can't help but feel like it could be done better. I thought maybe regex, but I was having trouble with it matching sub objects and arrays and not the outermost json object or array. Any suggestions?

midori
  • 4,807
  • 5
  • 34
  • 62
Gillespie
  • 5,780
  • 3
  • 32
  • 54

2 Answers2

9

You can locate the start of the JSON by checking the presence of { or [ and then save everything to the end of the string into a capturing group:

>>> import re
>>> string1 = 'bob1: The ceo of the company {"salary": 100000}'
>>> string2 = 'bob1: The ceo of the company ["10001", "10002"]'
>>> 
>>> re.search(r"\s([{\[].*?[}\]])$", string1).group(1)
'{"salary": 100000}'
>>> re.search(r"\s([{\[].*?[}\]])$", string2).group(1)
'["10001", "10002"]'

Here the \s([{\[].*?[}\]])$ breaks down to:

  • \s - a single space character
  • parenthesis is a capturing group
  • [{\[] would match a single { or [ (the latter needs to be escaped with a backslash)
  • .*? is a non-greedy match for any characters any number of times
  • [}\]] would match a single } and ] (the latter needs to be escaped with a backslash)
  • $ means the end of the string

Or, you may use re.split() to split the string by a space followed by a { or [ (with a positive look ahead) and get the last item. It works for the sample input you've provided, but not sure if this is reliable in general:

>>> re.split(r"\s(?=[{\[])", string1)[-1]
'{"salary": 100000}'
>>> re.split(r"\s(?=[{\[])", string2)[-1]
'["10001", "10002"]'
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
4

You would use simple | in regex matching both needed substrings:

import re
import json

def json_from_s(s):
    match = re.findall(r"{.+[:,].+}|\[.+[,:].+\]", s)
    return json.loads(match[0]) if match else None

And some tests:

print json_from_s('bob1: The ceo of the company {"salary": 100000}')
print json_from_s('bob1: The ceo of the company ["salary", 100000]')
print json_from_s('bob1')
print json_from_s('{1:}')
print json_from_s('[,1]')

Output:

{u'salary': 100000}
[u'salary', 100000]
None
None
None
midori
  • 4,807
  • 5
  • 34
  • 62
  • Consider this case: `'bob1: The ceo of the company [{"salary": 100000}]'`. The regex only matches the inner json object and not the outer json array – Gillespie Jan 23 '16 at 16:49
  • I only follow the ops question and explanation – midori Jan 23 '16 at 18:14
  • I am the OP, and the explanation I gave is that the string can be of the form `name multi word description [...]`. The case I gave you above follows that pattern, but the regex fails to capture it. – Gillespie Jan 23 '16 at 20:14
  • It doesn't fail to catch [...] as you could see from the tests, the one you provided in the comment above won't be caught by the accepted answer either because you didn't specify in your question that json might be inside the list – midori Jan 23 '16 at 20:17
  • If you want just catch any json in the string but not list, remove the or part in the regex – midori Jan 23 '16 at 20:22
  • Yes, the one I provided in the comment is caught correctly by the accepted answer. The accepted answer captures `[{"salary": 100000}]` whereas your answer only captures `{"salary": 100000}`, which is incorrect. – Gillespie Jan 23 '16 at 20:50
  • And lists are valid JSON, and I specified in the question that "{...} and [...] are any valid JSON". A list with an object inside it is also valid json. – Gillespie Jan 23 '16 at 20:51
  • i added small change, check now – midori Jan 23 '16 at 20:55