-1

Imagine these samples:

["And he said "Hello world" and went away"]

Or:

{"key": "value " with quote in the middle"}

Or

["invalid " string", "valid string"]

Those are invalid json and it is pretty obvious how to fix those by escaping quotes there. I am getting those JSON from bugged system (which I don't control and cannot change) so I have to fix it on my side. In theory it should be pretty simple - if you are withing string and you find quote character, it should be immediately followed by either:

  • comma and another quote ,"
  • end of array if in array ]
  • end of object if in object }

In all other cases, the quote can be considered part of the string and quoted.

Before I start implementing this.

  • Are there any libraries that are handling this?
  • Any json parser libraries that can be easily customized to do this?
  • Are there any errors in my logic?
K.H.
  • 1,383
  • 13
  • 33
  • It's not generally possible to do this reliably. You should fix the source so it produces valid JSON, not spend lots of time trying to deal with it. – Barmar Feb 21 '23 at 20:51
  • Sure, I wish I could, but the source is not under my control. – K.H. Feb 21 '23 at 20:52
  • 1
    What ever you try can have false positives if the data also contains embedded commas, brackets, or braces next to the quotes that are not escaped. – Barmar Feb 21 '23 at 20:53
  • 1
    I don't think there is a general library for this - every buggy JSON implementation produces buggy JSON in its own idiosyncratic way. But maybe you could fix this up using regexes - you could search for quotes not followed by the three things you note using a [negative lookahead](https://stackoverflow.com/questions/31201690/find-word-not-followed-by-a-certain-character), and replace them with escaped quotes. – Nick ODell Feb 21 '23 at 20:53
  • E.g. ```["And he said "Hello world"] and went away"]``` – Barmar Feb 21 '23 at 20:54
  • I am aware this solution will never be errorprone and for example it will fail on nested "json in json", but that is not my case. Problems are really as simple as in the example so the fix should work. – K.H. Feb 21 '23 at 20:55
  • 3
    You need to tell whoever assigned this that it's not possible to do it reliably. They should push back on the data supplier and tell them to provide proper data. – Barmar Feb 21 '23 at 20:55
  • 1
    OK, then use a regexp as suggested above. If you can't get it working to your satisfaction, come back with your code and we'll help you fix it. – Barmar Feb 21 '23 at 20:56
  • 2
    One other thing to consider: When you build heuristics that work correctly only with data malformed in ways you expect, that leaves an opening for attackers deliberately generating data malformed in ways you _don't_ expect. Someone can instead of passing `value " with quote in the middle` pass `value", "otherkey": "other stuff here`, and thereby inject arbitrary keys, which means that you can't trust values to have originally been associated with the key that your post-processing file _claims_ that value came from. – Charles Duffy Feb 21 '23 at 21:06
  • 1
    This might not matter for the contents of the data file and how you're processing it _today_, but that kind of complexity lives in the system for as long as the system survives; compromise now, and folks maintaining the system in a decade -- with whatever extensions and new functionality has been added -- are still subject to that compromise. – Charles Duffy Feb 21 '23 at 21:07
  • I appreciate all the comments, but please leave out the "fix source". I asked here to examine this specific way while being aware of it's risks so I want to focus on this solution even if just in theory. "don't do it" comments are not constructive in this sense :) – K.H. Feb 21 '23 at 21:12
  • As long as you're pushing back on the folks providing the data at least as hard as we're pushing back on you, I can't ask for anything more; and I do wish you good fortune in getting your practical problem solved. :) – Charles Duffy Feb 21 '23 at 21:19
  • Thanks! I am, but there's no point talking about it here. – K.H. Feb 21 '23 at 21:23

1 Answers1

0

In the end I came up with this. Regex are not enough since I want to hold context of what's currently happening in json to be a bit more restrictive. Could be improved by using the whole stack and to knowing, what is current array/object nesting, but this is simple enough to work for my usecase.

And yes since this is only modifying the string, I can leave it out if I get the original source fixed.

import json
import re
import unittest
from json import JSONDecodeError
from typing import Any

expected_characters_by_prestring_value = {
    "[": (",", "]"),
    "]": ("[", ","),
    "{": (":",),
    "}": (",", "{", "]"),
    ":": (",", "}"),
}


def fix_unescaped_quotes(raw: str) -> str:
    in_string = False
    output = ""
    nesting_stack = []
    for index, character in enumerate(raw):
        if character == '"' and raw[index - 1] != "\\":
            if in_string:
                first_nonwhite_character_ahead = re.search(
                    r"\S", raw[index + 1:]
                ).group()
                if first_nonwhite_character_ahead in expected_characters_by_prestring_value[
                    nesting_stack[-1]]:  # (",", "]", "}", ":"):
                    in_string = False
                else:
                    output += "\\"
            else:
                in_string = True
        else:
            if not in_string:
                if character.strip() != "" and character not in (",",):
                    nesting_stack.append(character)
        output += character
    return output


def parse_and_fix(raw: str) -> Any:
    try:
        return json.loads(raw)
    except JSONDecodeError:
        return json.loads(fix_unescaped_quotes(raw=raw))


class JsonFixUnescapedQuotesTest(unittest.TestCase):
    def test_completely_invalid(self):
        with self.assertRaises(JSONDecodeError):
            parse_and_fix("invalid_json")

    def test_valid(self):
        self.assertEqual({}, parse_and_fix("{}"))

    def test_invalid_single_array(self):
        self.assertEqual(
            ['he said "hello world" and left'],
            parse_and_fix("""["he said "hello world" and left"]"""),
        )

    def test_invalid_object(self):
        self.assertEqual(
            {"key": 'value " with quote in the middle'},
            parse_and_fix("""{"key": "value " with quote in the middle"}"""),
        )

    def test_invalid_2_item_array(self):
        self.assertEqual(
            ['invalid " string', "valid string"],
            parse_and_fix("""["invalid " string", "valid string"]"""),
        )

    def test_wont_get_fooled_by_colon(self):
        self.assertEqual(
            ['invalid ": string', "valid string"],
            parse_and_fix("""["invalid ": string", "valid string"]"""),
        )

    def test_wont_get_fooled_by_colon_after_object(self):
        self.assertEqual(
            {"key": "value\":"},
            parse_and_fix("""{"key": "value":"}"""),
        )

    def test_wont_get_fooled_by_comma_in_key(self):
        self.assertEqual(
            {"key\",": "value"},
            parse_and_fix("""{"key",": "value"}"""),
        )
K.H.
  • 1,383
  • 13
  • 33