0

I'm trying to find a way to parse the following string, into a list of strings using regex.

{"first_statement" : 1, "bleh" : { "some_data" : True } }, {"second_statement" : 2}

# Group 1:
{"first_statement" : 1, "bleh" : { "some_data" : True } }

# Group 2:
{"second_statement" : 2}

I want my regex to match the most outer braces pattern, no matter how many internal braces there are. For instance...

{"first_statement" : 1, "bleh" : { "some_data" : True, "foo" : { "bar" : { "zing" : False } } } }

# Group 1:
{"first_statement" : 1, "bleh" : { "some_data" : True, "foo" : { "bar" : { "zing" : False } } } }

I haven't got much experience with regex, but I tried some things, and the closer I got is a simple pattern... {.*?}, but it obviously closed my match when it first encountered a closing braces. Until then, all my other attempts failed, the closer I got was a .NET regex solution but I couldn't get it to work on python.

Is there even a way to do it using python regex, or do I have to parse my string character by character using a simple loop ? As far as I have researched exploring the All tokens of regex101, there is no simple way to achieving this.

Note : I don't care about the characters in between the first layer of braces. I want to ignore them.

IMCoins
  • 3,149
  • 1
  • 10
  • 25
  • Is there any reason why you don't want to parse these as normal JSON strings? – Smuuf Apr 10 '18 at 10:20
  • @Smuuf This is actually some json, but for the first example, this is some poorly formatted json. I don't even know if I should be trying to handle such format, but I thought I'd give it a try. The idea was to try some operations on bad json input in order to handle it anyway. The correct way of formatting the first json would have been to add `[]` around the string for instance. – IMCoins Apr 10 '18 at 10:27
  • 1
    So why don't you add those braces and handle it like a normal JSON string? `json.loads('[' + s + ']')` – Graipher Apr 10 '18 at 10:27
  • This is called recursive calls and python `re` module doesn't support recursions (subroutines). You have a solution if you are able to import newer `regex` module. – revo Apr 10 '18 at 10:29
  • @Graipher It could be an interesting way of solving my original problem considering there wouldn't be any other mistakes. But anyway, putting aside my original goal which is tendentious, I am still curious about ways of solving the question. :) – IMCoins Apr 10 '18 at 10:33
  • 1
    @IMCoins: If what we're talking about here is accepting invalid JSONs as valid input into your application, *I simply wouldn't do that, if I were you*. Invalid JSONs are invalid for a reason and you shouldn't want to allow that kind of input into your app. **Doing magic stuff like that will IMHO only bring you "pain and suffering"**, for you as the developer - and probably to your clients, too, because it introduces a certain level of unpredictability to the *- otherwise pretty standardised -* system of how JSONs are supposed to be handled. – Smuuf Apr 10 '18 at 10:36
  • 1
    @Smuuf I can only agree with what you're saying. I won't integrate this into my app. This being said, for curiosity purposes, I let this question open. – IMCoins Apr 10 '18 at 10:39

2 Answers2

1

One way without regex is to use ast.literal_eval:

from ast import literal_eval

mystr = '{"first_statement" : 1, "bleh" : { "some_data" : True } },
         {"second_statement" : 2}'

lst = list(map(str, literal_eval('['+mystr+']')))

# ["{'first_statement': 1, 'bleh': {'some_data': True}}",
#  "{'second_statement': 2}"]
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    I upvoted your answer even though I am not accepting it for the moment as I want to see if some other people want to contribute. :) – IMCoins Apr 10 '18 at 10:35
0

For the special case that your string is an almost legal JSON string only missing the surrounding braces (which seems to be almost the case here), you can just add the braces and try to parse it as a JSON string:

import json 
s = '{"first_statement" : 1, "bleh" : { "some_data" : "True" } }, {"second_statement" : 2}'
try:
    x = json.loads('[' + s + ']')
except json.JSONDecodeError:
    # do something?
    x = None
print(x)
# [{'bleh': {'some_data': 'True'}, 'first_statement': 1},
#  {'second_statement': 2}]

This is similar to adding the braces and parsing it using ast.literal_eval, as suggested by @jpp in his answer, but will be a bit stricter on what it accepts (because the string needs to be a legal JSON string, except for the missing list braces). Note for example that I needed to add quotes around the True, to make it so.

Graipher
  • 6,891
  • 27
  • 47