2

I got an unexpected quote in my json string that make json.loads(jstr) fails.

json_str = '''{"id":"9","ctime":"2018-02-13","content":"abcd: "efg.","hots":"103b","date_sms":"2017-11-22"}'''

So I'd like to use the regular expression to match and delete the quote inside the value of "content". I tried something in other solution:

import re
json_str = '''{"id":"9","ctime":"2018-02-13","content":"abcd: "efg.","hots":"103b","date_sms":"2017-11-22"}'''
pa = re.compile(r'(:\s+"[^"]*)"(?=[^"]*",)')
pa.findall(json_str)

[out]: []

Is there any way to fix the string?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Kai Wang
  • 53
  • 1
  • 6

2 Answers2

3

As noted by @jonrsharpe, you'd be far better off cleaning the source.
That said, if you do not have control over where the extra quote is coming from, you could use (*SKIP)(*FAIL) using the newer regex module and neg. lookarounds like so:

"[^"]+":\s*"[^"]+"[,}]\s*(*SKIP)(*FAIL)|(?<![,:])"(?![:,]\s*["}])

See a demo on regex101.com.


In Python:
import json, regex as re

json_str = '''{"id":"9","ctime":"2018-02-13","content":"abcd: "efg.","hots":"103b","date_sms":"2017-11-22"}'''

# clean the json
rx = re.compile('''"[^"]+":\s*"[^"]+"[,}]\s*(*SKIP)(*FAIL)|(?<![,:])"(?![:,]\s*["}])''')
json_str = rx.sub('', json_str)

# load it

json = json.loads(json_str)
print(json['id'])
# 9
Jan
  • 42,290
  • 8
  • 54
  • 79
0

A possible solution I used:

whole = []
count = 0
with open(filename) as fin:
    for eachline in fin:
        pa = re.compile(r'"content":\s?"(.*?","\w)')
        for s in pa.findall(eachline):
            s = s[:-4]
            s_fix = s.replace("\"","")
            eachline = eachline.replace(s,s_fix)

        data = json.loads(eachline)
        whole.append(data)
Kai Wang
  • 53
  • 1
  • 6