0

I have a JSON file of the following format:

/* 1 */
{
    "input" : {
        "text" : "car"
    },
    "output" : {
        "text" : [ 
            "Sorry! I didn't understand. Please rephrase."
        ],
    },
        "Auth" : "You are authenticated",
        "Begin" : "Started",
        "ID" : "ABC"
    },
    "timestamp" : ISODate("2017-02-20T12:36:40.226Z"),
}
/* 2 */
{
    "input" : {
        "text" : "bat"
    },
    "output" : {
        "generic" : [ 
            {
                "response_type" : "text",
                "text" : "I understood that!"
            }
        ],
        "text" : [ 
            "I understood that!"
        ],
        "log_messages" : []
    },
        "Auth" : "You are authenticated",
        "Begin" : "Started",
        "ID" : "CDE"
    },
    "timestamp" : ISODate("2016-02-20T12:36:40.226Z")
}

I would like to create a dataframe of the following format but I cannot figure out how to do it (I am not sure if you would have to clean the json file to not include comments):

ID     Date           Input["text"]    Output["text"]
ABC    2017-02-20     car              Sorry! I didn't understand. Please rephrase.
CDE    2016-02-20     bat              I understood that!

Can anyone help?

helloWorld
  • 11
  • 1
  • Why are you storing an object in a JSON file? The object should be parsed into JSON format first. – Spencer Pollock Mar 05 '19 at 18:10
  • 2
    That's not a valid JSON file. Comments aren't allowed, and the file should be one object or array. If you want to store multiple objects, they should be wrapped in an array. – Barmar Mar 05 '19 at 18:11
  • You also have an extra `,` at the end of an object, `ABC` and `CDE` need to be in quotes, and you have mismatching curly braces. – Barmar Mar 05 '19 at 18:14
  • `ISODate("2016-02-20T12:36:40.226Z")` is not valid JSON -- JSON doesn't have function calls. – Barmar Mar 05 '19 at 18:15
  • If you need to process this, you'll need to write a custom parser. It has so many problems that there's no way you can use a standard JSON parser. – Barmar Mar 05 '19 at 18:16
  • @Barmar There might be some minor mistakes in copying and pasting - but essentially this is coming from a mongoDB database (so 'assume' formatting and everything else is correct). And unfortunately I have a file which has comments (the output extract does come with comment), and I need to find a way to get rid of the comments using python. – helloWorld Mar 05 '19 at 18:26

1 Answers1

1

As some of the comments have noted, that is not a valid json. If you can get the formatting right, then pandas has pd.read_json which you can use directly as a dataframe constructor. This is the right thing to do. You can build a custom parser, but it's way less efficient than having your data distributed in a standardized format that pandas can immediately parse. If this is coming from MongoDB like you mentioned, Mongo has a built in json utility for parsing its data, you may be able to get the right output using its shell features. The link below has some details on that. https://docs.mongodb.com/manual/reference/mongodb-extended-json/

If you cannot get the right format, you'll need to have pandas parse the json as a csv and pass the read_csv sep parameter a regex that pattern matches the format of your document. The first link shows an example of doing that. The second link is to a tutorial for parsing text with regular expressions, it'll help you generate a pattern. The third link is the w3schools tutorial on regexes which has a good reference section on regex sequences and usage. You'll need to find a way to eliminate the ISODate calls and then you can use pandas to ensure that column is in a proper datetime format, but parsing this will require treating those elements as strings and not functions. Remember, I think it's best you just get mongo to spit out an actual json without the mess, but if you cannot conceivably do that, defining a custom parser using regex inside the pd.read_csv function is the way to go.

Create pandas dataframe from json objects

https://www.vipinajayakumar.com/parsing-text-with-python/

https://www.w3schools.com/python/python_regex.asp#matchobject

Quick note: naming pandas dataframe columns that way, i.e. "input['type']", shadows the call for a pandas series. If I parsed your file into a data frame and then wanted to call an input column, I would need to write df['input["text"]'] which is not pythonic. It also makes a mess of your data because it means you are storing attributes of the column's rows inside the index, meaning that you would need a separate column for each category of input or you only have one category of input and its unnecessarily confusing. You'd be better off having a column that denoted whether or not the row's observation was text or another type, or better yet ensure that your input types match an established data type that can be stored as metadata, but i digress.

Sam
  • 91
  • 7