0

There are multiple questions related to deserializing JSON containing embedded quotes but don't see a python-specific solution to this:

Given log data that is only partially valid JSON, eg:

"{"link":"<a href="mylink">http://my.com</a>"}"

The inner quotes eg around "mylink" interfere with the outer quotes around individual key-value pairs.

Unescaped, these cause json.loads and ast.literal_eval (see here) to throw syntax error.

On the other hand, to hunt and escape inner quotes via regex is tricky because of the variable nested JSON structure (the above is just a minimal example) and the key-values are open ended with no available schema.

Any alternatives?

alancalvitti
  • 476
  • 3
  • 14
  • 4
    It is impossible to solve this in the general case. However, if you *only* care about this specific case you can probably solve it with a regex. The right solution is to fix your logging to escape quotes correctly. – Daniel Pryden Apr 23 '19 at 17:12
  • 2
    I second the first comment under the assumption that you do in fact have control over how this is structured. Otherwise, using regex, or some very "interesting" parsing might fit your need. – idjaw Apr 23 '19 at 17:15
  • 1
    "is only partially valid JSON" i.e. *it is not JSON*. So one way or another, you will have to figure out a way to parse your bespoke format. Or, probably a better idea, don't rely on a bespoke format and stick to JSON. – juanpa.arrivillaga Apr 23 '19 at 17:44
  • @juanpa.arrivillaga, correct it is not JSON. But don't throw out baby w bathwater. We have combination of nested JSON non-json JSON etc, from which we extract valid JSON at various levels with regex. However, these specific logs contain much more variability so the regex will be more complex. – alancalvitti Apr 23 '19 at 17:54
  • 2
    There is no baby. You have a muddle mess, and the best solution *would be to stick to a well supported format*. Otherwise, you are *forced* to parse it yourself somehow, whether hackily through regex or through implementing a full-fledged parser (check out [pyparsing](https://github.com/pyparsing/pyparsing)). Without additional details, there is not much better one can say. Something cannot be "nested JSON with non-JSON". Again **that just means not JSON**. – juanpa.arrivillaga Apr 23 '19 at 17:59
  • @juanpa.arrivillaga, perhaps in fantasyland you have that option, but the logs here are pre-existing. Moreover, for one platform, we can extract JSON from the muddle with ~95% rate, but in another platform that is much lower for the given reason. So it depends: sometimes there is a baby. Thanks for the link, will check. – alancalvitti Apr 23 '19 at 18:07
  • 1
    @alancalvitti: Please don't use the pejorative "fantasyland" here. Lots of people *do* manage to write logs in a format that is syntactically consistent and parseable. (Others don't, and yes I have seen lots of crazy things.) But juanpa is correct: what you have *is not JSON*, it's a bespoke textual output. Don't try to parse it as JSON, simply *parse* it as what it is. And be prepared for it to be complicated -- for example, I suspect the formal grammar of this mess will not be LL(1)-parseable like JSON is. – Daniel Pryden Apr 23 '19 at 18:36
  • @DanielPryden, Please suggest advice that conforms with reality: we don't have the option in modifying pre-existing logs - this is for a very popular and complex app, I don't get to write the logs, just analyze them. and please see previous comment as to the success rate in using `json.loads` in conjunction with some regex for one platform's log format. I stand by my comments. – alancalvitti Apr 24 '19 at 14:19
  • 1
    @alancalvitti: I'm not going to argue with you. The app in question may be popular and complex, but that doesn't make it correct. Yes, you have messy data. Yes, you will need to figure out a way to solve this. Your question as stated is about how to parse this data as JSON. **It cannot be parsed as JSON.** You could *parse* the text and recover data from it, using a lexing and parsing tool like pyparsing, or you could *convert* this text into JSON, by parsing it. Your acerbic responses to attempts to help you aren't making me want to help you any more, and in fact I won't respond any more. – Daniel Pryden Apr 24 '19 at 15:53
  • @DanielPryden, no need to reply, but in my opinion your stance is hypocritical as eg the [most frequent JSON related question](https://stackoverflow.com/questions/2835559/why-cant-python-parse-this-json-data) starts off with "I have this JSON". – alancalvitti Apr 24 '19 at 16:28

0 Answers0