0

I have 70k files all of which look similar to this:

{'id': 24, 'name': None, 'city': 'City', 'region_id': 19,
'story_id': 1, 'description': 'text', 'uik': None, 'ustatus': 'status',
'wuiki_tik_name': '', 'reaction': None, 'reaction_official': '',
'created_at': '2011-09-07T07:24:44.420Z', 'lat': 54.7, 'lng': 20.5,
'regions': {'id': 19, 'name': 'name'}, 'stories': {'id': 1, 'name': '2011-12-04'}, 'assets': [], 'taggings': [{'tags': {'id': 6, 'name': 'name',
'tag_groups': {'id': 3, 'name': 'Violation'}}},
{'tags': {'id': 8, 'name': 'name', 'tag_groups': {'id': 5, 'name': 'resource'}}},
{'tags': {'id': 1, 'name': '01. Federal', 'tag_groups': {'id': 1, 'name': 'Level'}}},
{'tags': {'id': 3, 'name': '03. Local', 'tag_groups': {'id': 1, 'name': 'stuff'}}},
{'tags': {'id': 2, 'name': '02. Regional', 'tag_groups':
{'id': 1, 'name': 'Level'}}}], 'message_id': None, '_count': {'assets': 0, 'other_messages': 0, 'similars': 0, 'taggings': 5}}

The ultimate goal is to export it into a single CSV file. It can be successfully done without flattening. But since it has a lot of nested values, I would like to flatten it, and this is where I began facing problems related to data types. Here's the code:

import json
from pandas.io.json import json_normalize
import glob

path = glob.glob("all_messages/*.json")
for file in path:
    with open(file, "r") as filer:
        content = json.loads(json.dumps(filer.read()))
        if content != 404:
            df_main = json_normalize(content)
            df_regions = json_normalize(content, record_path=['regions'], record_prefix='regions.', meta=['id'])
            df_stories = json_normalize(content, record_path=['stories'], record_prefix='stories.', meta=['id'])
            #... More code related to normalization

df_out.to_csv('combined_json.csv')

This code occasionally throws: AttributeError: 'str' object has no attribute 'values' or ValueError: DataFrame constructor not properly called!. I realise that this is caused by json.dumps() JSON string output. However, I have failed to turn it into anything useable.

Any possible solutions to this?

Octner
  • 61
  • 8
  • 1
    Does this answer your question? [Convert a String representation of a Dictionary to a dictionary](https://stackoverflow.com/questions/988228/convert-a-string-representation-of-a-dictionary-to-a-dictionary) – buran Oct 27 '22 at 06:13
  • 1
    Convert the file content to dict, that you can convert that to JSON or construct the DF another way – buran Oct 27 '22 at 06:14
  • @buran thanks for the suggestion. JSON does not accept single quotation marks and requires double. That is why I am trying to call json.dumps(), which, however, gives me JSON strings and appears to be hostile to pandas. – Octner Oct 27 '22 at 06:21
  • 1
    The content of the file is not JSON, but dict literal. It's another question how you end up with these files being not JSON, but dict literals with single quotes. That is why I suggested that you read the file, convert the str content of the file to dict object. Then you can dump that dict to valid JSON string and use it with json_normalize, or you can work directly with the dict to construct the DataFrame, whichever is prefered – buran Oct 27 '22 at 07:09
  • 1
    add a try-except control and if it fails, it adds the file it receives an error to a list. You may need to write a new condition according to the result of comparing the data of the file you got an error with with the others. – Bushmaster Oct 27 '22 at 07:22
  • @buran I ended up extracting columns from the csv file that I originally generated and the one that needed normalization. Then took each column that still contained dict literals, replaced the single quotation marks using replace, and then normalized them one by one depending on the level of nesting. I kept encountering errors working with the files directly, so that's the workaround that eventually worked. Thanks for your explanation – Octner Oct 28 '22 at 07:16

2 Answers2

1

If you only need to change ' to ":

...
for file in path:
    with open(file, "r") as filer:
        filer.replace("\'", "\"")
...

Making copies and using grep would be easier

Talpa
  • 66
  • 3
1

While it is not the solution I was initially expecting, this approach worked as well. I kept getting error messages related to the structure of the dict literals that were reluctant to become json, so I took the csv file that I wanted to normalise and worked with each column one by one:

df = pd.read_csv("combined_json.csv")
df['regions'] = df['regions'].apply(lambda x: x.replace("'", '"'))
regions = pd.json_normalize(df['regions'].apply(json.loads).tolist()).rename(
    columns=lambda x: x.replace('regions.', ''))
df['regions'] = regions['name']

Or, if it had more nested levels:

df['taggings'] = df['taggings'].apply(lambda x: x.replace("'", '"'))
taggings = pd.concat([pd.json_normalize(json.loads(j)) for j in df['taggings']])
df = df.reset_index(drop=True)
taggings = taggings.reset_index(drop=True)
df[['tags_id', 'nametag', 'group_tag', 'group_tag_name']] = taggings[['tags.id', 'tags.name', 'tags.tag_groups.id', 'tags.tag_groups.name']]

Which was eventually df.to_csv().

Octner
  • 61
  • 8