-1

I need to convert a complex json file to csv using python, I tried a lot of codes without success, I came here for help,I updated the question, the JSON file is about a million,I need to convert them to csv format

csv file

{
    "_id": {
        "$oid": "2e3230"
    },
    "add": {
        "address1": {
            "address": "kvartira 14",
            "zipcode": "10005",
        },
        "name": "Evgiya Kovava",
        "address2": {
            "country": "US",
            "country_name": "NY",
        }
    }
}
{
    "_id": {
        "$oid": "2d118c8bo"
    },
    "add": {
        "address1": {
            "address": "kvartira 14",
            "zipcode": "52805",
        },
        "name": "Eiya tceva",
        "address2": {
            "country": "US",
            "country_name": "TX",
        }
    }
}

lex9527
  • 5
  • 3
  • What did you try? Post your code – Barry the Platipus Jul 17 '22 at 13:42
  • How is the csv supposed to look like? If we suppose you need a dataframe structure, what are the columns, what the rows? – Rockbar Jul 17 '22 at 13:44
  • `cvs` keeps all as simple `rows` and `columns` but you have nested elements - you have to show what result you expect. Should it flatten it or it should keep as nested elements (dictionary inside cell)? – furas Jul 17 '22 at 14:59
  • if you tried some methods then you should show it in question (not in comments) - this way we could see what wrong you did, and we could suggest how to fix it. – furas Jul 17 '22 at 15:00
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Jul 17 '22 at 17:39

1 Answers1

1
import pandas as pd

null = 'null'

data = {
    "_id": {
        "$oid": "2e3230s314i5dc07e118c8bo"
    },
    "add": {
        "address": {
            "address_type": "Door",
            "address": "kvartira 14",
            "city": "new york",
            "region": null,
            "zipcode": "10005",
        },
        "name": "Evgeniya Kovantceva",
        "type": "Private person",
        "code": null,
        "additional_phone_nums": null,
        "email": null,
        "notifications": [],
        "address": {
            "address": "kvartira 14",
            "city": "new york",
            "region": null,
            "zipcode": "10005",
            "country": "US",
            "country_name": "NY",
        }
    }
}

df = pd.json_normalize(data)
df.to_csv('yourpath.csv')

Beware the null value. The "address" nested dictionary comes inside "add" two times almost identical?

EDIT

Ok after your information it looks like json.JSONDecoder() is what you need.

Originally posted by @pschill on this link: how to analyze json objects that are NOT separated by comma (preferably in Python)

I tried his code on your data:

import json 
import pandas as pd

data = """{
    "_id": {
        "$oid": "2e3230"
    },
    "add": {
        "address1": {
            "address": "kvartira 14",
            "zipcode": "10005"
        },
        "name": "Evgiya Kovava",
        "address2": {
            "country": "US",
            "country_name": "NY"
        }
    }
}
{
    "_id": {
        "$oid": "2d118c8bo"
    },
    "add": {
        "address1": {
            "address": "kvartira 14",
            "zipcode": "52805"
        },
        "name": "Eiya tceva",
        "address2": {
            "country": "US",
            "country_name": "TX"
        }
    }
}"""

Keep in mind that your data also has trailing commas which makes the data unreadable (the last commas right before every closing bracket).

You have to remove them with some regex or another approach I am not familiar with. For the purpose of this answer I removed them manually.

So after that I tried this:

content = data
parsed_values = []
decoder = json.JSONDecoder()
while content:
    value, new_start = decoder.raw_decode(content)
    content = content[new_start:].strip()
    # You can handle the value directly in this loop:
    # print("Parsed:", value)
    # Or you can store it in a container and use it later:
    parsed_values.append(value)

which gave me an error but the list seems to get populated with all the values:

parsed_values
[{'_id': {'$oid': '2e3230'},
  'add': {'address1': {'address': 'kvartira 14', 'zipcode': '10005'},
   'name': 'Evgiya Kovava',
   'address2': {'country': 'US', 'country_name': 'NY'}}},
 {'_id': {'$oid': '2d118c8bo'},
  'add': {'address1': {'address': 'kvartira 14', 'zipcode': '52805'},
   'name': 'Eiya tceva',
   'address2': {'country': 'US', 'country_name': 'TX'}}}]

next I did:

df = pd.json_normalize(parsed_values)

which worked fine. You can always save that to a csv with:

df.to_csv('yourpath.csv')

Tell me if that helped.

Your json is quite problematic after all. Duplicate keys (problem), null value (unreadable), trailing commas (unreadable), not comma separated dicts... It didn't catch the eye at first :P

diml
  • 136
  • 4
  • The code works fine, upvoted for you – lex9527 Jul 17 '22 at 18:08
  • However, it errors when doing a lot of conversions, how can I load the JSON file as well as flatten it – lex9527 Jul 17 '22 at 18:13
  • My answer was for the specific json. What do u mean a lot of conversions? Other jsons? With other structure-data? What errors do u get? You understand that my answer was based only on the data given. You could provide more information I guess – diml Jul 17 '22 at 18:19
  • I mean how to open json file using "open" or "read_json", my json file has about a million entries – lex9527 Jul 17 '22 at 18:48
  • I updated the json sample, you can see it in the question – lex9527 Jul 17 '22 at 18:49
  • You provided two dictionaries (somehow different from the original post-they dont have the duplicate address key etc) but something is missing again. Are they inside a list? [ {dict1}, {dict2} ] inside a string? ' {dict1}, {dict2} ' or inside a dictionary? { {dict1}, {dict2} } – diml Jul 17 '22 at 19:18
  • form:{dict1}{dict2}{dict3}{dict4} The json sample is the same as the first time,but,The json file was too long to post, so I removed some unnecessary information, – lex9527 Jul 17 '22 at 20:06
  • If for instance these two objects where inside a list separated by a comma [ {obj1}, {obj2} ] , pd.json_normalize() would suite you well. What is the first and the last character of your json and what seperates the objects? I still feel you don't give all the data needed...alse keep in mind that in the original post there is a duplicate key (address) which gets overriden by the last so you eventually lose "address_type" information unless you think of something else. – diml Jul 17 '22 at 20:08
  • nice, you are very clear, I have solved the problem, – lex9527 Jul 18 '22 at 06:04
  • Good. I hope you saw my edit too. Good luck – diml Jul 18 '22 at 08:47
  • Yes it works, but very slow, about tens of KB a minute, sad – lex9527 Jul 18 '22 at 10:31