5

I am trying to parse JSON input as string in Python, not able to parse as list or dict since the JSON input is not in a proper format (Due to limitations in the middleware can't do much here.)

{
  "Records": "{Output=[{_fields=[{Entity=ABC  , No=12345, LineNo=       1, EffDate=20200630}, {Entity=ABC  , No=567, LineNo=       1, EffDate=20200630}]}"
}

I tried json.loads and ast.literal (invalid syntax error).

How can I load this?

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
Ankit Singh
  • 57
  • 1
  • 3
  • If you're talking about what's in the value of `Records`, that's just not JSON at all. It's not just a matter of missing quotes (though that's a problem), it also uses `=` instead of `:`. It's just a different data format entirely. – T.J. Crowder Aug 06 '21 at 13:19
  • What you have is valid JSON, but the value of the `Records` key is not (nor is it a proper Python dict display), and it's not even particularly close. You'll need to find some other parser, or convince the producer to make it valid JSON. – chepner Aug 06 '21 at 13:20
  • Can you elaborate a little bit more on your question? Paste a code sample? The string you put in your question does appear to be in proper format, with one JSON item with the "Records" key and the value being the string contained within the double quotes following the key. My best bet is that perhaps you put some sort of escape character in the string. Please paste the code that returns the syntax error. – czyngis Aug 06 '21 at 13:20
  • 1
    @czyngis I suspect the OP is trying to do something like `json.loads(result["Records"])`, which is just doomed to failure. – chepner Aug 06 '21 at 13:22
  • @T.J.Crowder yeah that is another issue but I was thinking of replacing = with : since it is a string, but the quotes is the major problem – Ankit Singh Aug 06 '21 at 13:27
  • You're gonna have to write your own parser – Petr L. Aug 06 '21 at 13:33

4 Answers4

3

The sad answer is: the contents of your "Records" field are simply not JSON. No amount of ad-hoc patching (= to :, adding quotes) will change that. You have to find out the language/format specification for what the producing system emits and write/find a proper parser for that particular format.

As a clutch, and only in the case that the above example already captures all the variability you might see in production data, a much simpler approach based on regular expressions (see package re or edd's pragmatic answer) might be sufficient.

ojdo
  • 8,280
  • 5
  • 37
  • 60
  • Why, the patching will make it json – Petr L. Aug 06 '21 at 13:42
  • 2
    Only if we know completely how this data format would handle more advanced scenarios, like string/text data in one of its field. The simple solutions proposed right now all would fall on the nose if `Entity` could emit a multi-word string. Or maybe a quoted string if it contains a comma, bracket, curly bracket, ... We simply cannot guess working code from a single example. – ojdo Aug 06 '21 at 13:45
3

If the producer of the data is consistent, you can start with something like the following, that aims to bridge the JSON gap.

import re
import json


source = {
  "Records": "{Output=[{_fields=[{Entity=ABC  , No=12345, LineNo=       1, EffDate=20200630}, {Entity=ABC  , No=567, LineNo=       1, EffDate=20200630}]}"
}

s = source["Records"]

# We'll start by removing any extraneous white spaces
s2 = re.sub('\s', '', s)

# Surrounding any word with "
s3 = re.sub('(\w+)', '"\g<1>"', s2)

# Replacing = with :
s4 = re.sub('=', ':', s3)

# Lastly, fixing missing closing ], }
## Note that }} is an escaped } for f-string.
s5 = f"{s4}]}}"

>>> json.loads(s5)
{'Output': [{'_fields': [{'Entity': 'ABC', 'No': '12345', 'LineNo': '1', 'EffDate': '20200630'}, {'Entity': 'ABC', 'No': '567', 'LineNo': '1', 'EffDate': '20200630'}]}]}

Follow up with some robust testing and have a nice polished ETL with your favorite tooling.

edd
  • 1,307
  • 10
  • 10
  • thanks a ton for your answer, this is exactly what I was looking for, since the input data will always be consistent I can go ahead with this approach and yes only after robust testing. – Ankit Singh Aug 06 '21 at 17:40
0

As i understand you are trying to parse the value of the Records item in the dictionary as JSON, unfortunately you cannot.

The string in that value is not JSON, and you must write a parser that will first parse the string into a JSON string according to the format that the string is written in by yourself. ( We don't know what "middleware" you are talking of unfortunately ).

tldr: Parse it into a JSON string, then parse the JSON into a python dictionary. Read this to find out more about JSON ( Javascript Object Notation ) rules.

czyngis
  • 413
  • 2
  • 5
  • 18
  • 1
    Why parse it "into a JSON string", i.e. the detour, not directly into a Python data structure? – ojdo Aug 06 '21 at 13:34
  • @ojdo OP is saying in the title of the question that he wants to parse JSON. He wants to parse JSON so i have replied how to parse it as JSON. Also, i think it's easier to write a parser which just manipulates the string and then use the json module to load it. – czyngis Aug 06 '21 at 13:37
0

There you go. This code will make it valid json:

notjson = """{
  "Records": "{Output=[{_fields=[{Entity=ABC  , No=12345, LineNo=       1, EffDate=20200630}, {Entity=ABC  , No=567, LineNo=       1, EffDate=20200630}]}"
}"""
notjson = notjson.replace("=","':") #adds a singlequote and makes it more valid
notjson = notjson.replace("{","{'")
notjson = notjson.replace(", ",", '")
notjson = notjson.replace("}, '{","}, {")
json = "{" + notjson[2:]
print(json)
print(notjson)
Petr L.
  • 414
  • 5
  • 13