-2

I've to parse a text file that contains different kind of data. The most challenging is a line that contains three different JSON object (strings) and other data between them. I've to divide the Json data from the rest. The good thing is this: every Json object start with a name. The issue I'm having with regex is isolate the first Json string obj from the others, and parse it using json. Here my solution (it works) but i bet there is something better... I'm not good in regex yet.

#This parse a string for isolate the first JSON serialized Object.
def get_json_first_end(text):
    ind_ret = 0
    ind1 = 0
    for i,v in enumerate(text):
        if v == '{':
            ind1 = ind1 + 1
        if v == '}':
            ind1 = ind1 - 1
        if ind1 == 0:
            ind_ret = i
            break
    return ind_ret

#This return a string that contain the JSON object
def get_json_str(line,json_name):
    js_str = ''
    if re.match('(.*)' + json_name + '(.*)',line):
        #Removing all spurious data before and after the Json obj
        data = re.sub('(.*)'+ json_name,'',line)
        ind1 = data.find('{')
        ind2 = data.rfind('}')
        ind3 = get_json_first_end(data[ind1:ind2+1])
        js_str = data[ind1:ind3+2]
    return js_str

If i don't call get_json_first_end the ind2 can be wrong if there are multiple json strings in the same line. The get_json_str return a string with the JS string obj I want and I can parse it with json without issues. My question is: there is a better way to do this? get_json_first_end seems quite ugly. Thanks

Update: here an example line:

ConfigJSON ["CFG","VAR","1","[unused bit 2]","[unused bit 3]","[unused bit 4]","[unused bit 5]"] 2062195231AppTitle "Fsdn" 3737063363Bits ["RESET","QUICK","KILL","[unused bit 2]","[unused bit 3]","[unused bit 4]","[unused bit 5]"] 0837383711CRC 33628 0665393097ForceBits {"Auxiliary":[{"index":18,"name":"AUX1.INPUT"},{"index":19,"name":"AUX2.INPUT"}],"Network":[{"index":72,"name":"INPUT.1"}],"Physical":[]}

2 Answers2

0

Your string is custom format. It may be possible to do with regex. I have tried with simple loop. You need to find open bracket [ or }, get corresponding closing bracket ] or }.

>>>string = '["CFG","VAR","1","[unused bit 2]","[unused bit 3]","[unused bit 4]","[unused bit 5]"] 2062195231AppTitle "Fsdn" 3737063363Bits ["RESET","QUICK","KILL","[unused bit 2]","[unused bit 3]","[unused bit 4]","[unused bit 5]"] 0837383711CRC 33628 0665393097ForceBits {"Auxiliary":[{"index":18,"name":"AUX1.INPUT"},{"index":19,"name":"AUX2.INPUT"}],"Network":[{"index":72,"name":"INPUT.1"}],"Physical":[]}'

>>> def getjson(string):
    square = ['[',']']
    curly = ['{','}']
    count = 0
    json_list = []
    character = ''
    complement_character = ''
    start = 0
    end = 0
    for i in range(len(string)):
        if not character:
            if string[i] is square[0]:
                character = square[0] 
                complement_character = square[1]
                start = i
                count += 1
            elif string[i] is curly[0]:
                character = curly[0]
                complement_character = curly[1]
                start = i
                count += 1
        else:
            # when character [ or { is found find corresponding ] or } using count.
            if string[i] is character:
                count += 1
            elif string[i] is complement_character:
                count -= 1
            if count == 0 and character :
                character = ''
                complement_character = ''
                end = i+1
                json_list.append(json.loads(string[start:end]))
    return json_list

>>> print getjson(string)
[[u'CFG', u'VAR', u'1', u'[unused bit 2]', u'[unused bit 3]', u'[unused bit 4]', u'[unused bit 5]'], [u'RESET', u'QUICK', u'KILL', u'[unused bit 2]', u'[unused bit 3]', u'[unused bit 4]', u'[unused bit 5]'], {u'Physical': [], u'Auxiliary': [{u'index': 18, u'name': u'AUX1.INPUT'}, {u'index': 19, u'name': u'AUX2.INPUT'}], u'Network': [{u'index': 72, u'name': u'INPUT.1'}]}]
Netro
  • 7,119
  • 6
  • 40
  • 58
0

We can't match arbitrary nesting levels of braces with a regular expression, but we can support a limited amount of nesting; e. g. this works for up to one level of inner braces, as your example line:

def get_json_str(line, json_name):
    m = re.search(json_name+" *({[^{}]*({[^{}]*}[^{}]*)*})", line)
    if m: return m.group(1)
Armali
  • 18,255
  • 14
  • 57
  • 171