2

I'm having trouble parsing a json string in python because there are extra double quotes inside the string values like {"name": "Jack O"Sullivan", "id": "1"}

I'm trying to convert it into list for further evaluation like so:

import ast
js = '{"name": "Jack O"Sullivan", "id": "1"}'
ast.literal_eval(js).values()  

How do I change the json string to be something like this "Jack O\'Sullivan", so that it evaluated properly.

Edit Just to stress that I know the json is invalid but this is what I've got and changing the source is NOT an option. I'm looking to work around this limitation at the moment.

sfactor
  • 12,592
  • 32
  • 102
  • 152
  • 3
    Where did this string come from? Might be easier to fix the source. – Daniel Roseman Oct 20 '16 at 09:25
  • 1
    `ast.literal_eval()` is not going to be able to decode this any better than `json.loads()` is, no. Unquoted quotes in strings are just as invalid in Python. – Martijn Pieters Oct 20 '16 at 09:25
  • 3
    Are you **100% certain** that the source produces unquoted quotes (and this isn't an artefact of trying to reproduce this in a Python string literal and forgetting to escape the escape)? If so, this is much, much, *much* easier fixed at the source, as it is now neigh impossible to detect what quotes are values and which ones are delimiters. – Martijn Pieters Oct 20 '16 at 09:26
  • 2
    It's completely invalid JSON, of course. – RemcoGerlich Oct 20 '16 at 09:32
  • 1
    Hi guys, yes I know the json is invalid. but correcting the source in NOT an option at the moment. Hence, I'm looking to manually replace the double quotes inside the strings. These occur inside a person's name or surname so there might be a way. – sfactor Oct 20 '16 at 09:36
  • @MartijnPieters For any arbitary JSON, you are correct and it is much more difficult, but in OP's case it isn't impossible or even difficult to parse. – TemporalWolf Oct 20 '16 at 10:18
  • @TemporalWolf: now add commas. Or any other punctuation, including brackets and colons. This is one small example, a regex is not going to survive larger, real world data sets. – Martijn Pieters Oct 20 '16 at 10:23
  • @MartijnPieters _Perfect is the enemy of good enough_. It doesn't need to solve larger problems: It needs to work on the current corrupt dataset. No reason to over-engineer a solution when a one line regex will fix the corrupted segments. In an ideal world you'd fix the source of the corruption, regenerate the corrupted parts and then continue on. But we don't live in an ideal world. The source for the corrupted JSON is, for whatever reason, no longer available. It's recover it or lose it. In this case, it is recoverable. – TemporalWolf Oct 20 '16 at 17:53
  • 1
    @TemporalWolf: this is Stack Overflow, where we try to build a repository of knowledge. Your little solution is going to be rolled out whenever someone without a full grasp of the problem does a Google search. It is important that the caveats are documented therefor. – Martijn Pieters Oct 20 '16 at 17:57

1 Answers1

3
import re

json = '{"name": "Jack O"Sullivan", "id": "1"}'

fixed = re.sub(r'("[\s\w]*)"([\s\w]*")',r"\1\'\2", json)

I suspect this will work (working example at repl.it), it uses the following regex:

("[\s\w]*)"([\s\w]*")

and then replacing any inner " with \'. This will work as long as the inclusion list is valid (the [\s\w]), ie valid strings will only include spaces and word characters. You may have to add additional possibilities for more complex names.

It matches any string "<alpha/space>"<alpha/space>" and then replaces it with "<whatwasbefore>\'<whatwasafter>" using capture groups and back references.

See the example at regex101

As I mentioned in the comments, the alternative is to make it exclude json control characters [^{}:,]. This should produce similar results, but won't miss names with other characters in them (like -, for example).

Community
  • 1
  • 1
TemporalWolf
  • 7,727
  • 1
  • 30
  • 50
  • yah this is what I needed. had a few more changes to include cases when there was an & character in the name but this worked for me. thanks ! – sfactor Oct 20 '16 at 09:55
  • @sfactor The alternative is to make an exclusion group of things in the json structure: `[^{},:]` would probably work. – TemporalWolf Oct 20 '16 at 09:59
  • that may make it more generalizable i guess. any examples of how that would work? – sfactor Oct 20 '16 at 10:01
  • @sfactor it works exactly the same way: `[^]` is a negated character set: it accepts anything that is **not** in the set. so `[^abc]` will accept any character that is not `a`, `b`, or `c`. – TemporalWolf Oct 20 '16 at 10:04