1

I have a csv with 500+ rows where one column "_source" is stored as JSON. I want to extract that into a pandas dataframe. I need each key to be its own column.

I have a 1mb JSON file of online social media data that I need to convert the dictionary and key values into their own separate columns. The social media data is from Facebook,Twitter/web crawled... etc.

There are approximately 528 separate rows of posts/tweets/text with each having many dictionaries inside dictionaries.

I am attaching a few steps from my Jupyter notebook below to give a more complete understanding. I need to turn all key value pairs for dictionaries inside dictionaries into columns inside a dataframe.

I have tried changing it to a dataframe by doing this

source = pd.DataFrame.from_dict(source, orient='columns')

and it returns something like this... I thought it might unpack the dictionary but it did not.

source.head()

_source
0   {'sub_organization_id': 'default', 'uid': 'aba...
1   {'sub_organization_id': 'default', 'uid': 'ab0...
2   {'sub_organization_id': 'default', 'uid': 'ac0...

below is the shape

source.shape
(528, 1)

Following is a sample row of "_source". There are many dictionaries and key:value pairs where each key needs to be its own column.

{
    'sub_organization_id': 'default',
    'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
    'project_veid': 'default',
    'campaign_id': 'default',
    'organization_id': 'default',
    'meta': {
        'rule_matcher': [{
                'atribs': {
                    'website': 'github.com/res',
                    'source': 'Explicit',
                    'version': '1.1',
                    'type': 'crawl'
                },
                'results': [{
                        'rule_type': 'hashtag',
                        'rule_tag': 'Far',
                        'description': None,
                        'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
                        'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
                        'value': '#Far',
                        'organization_id': None,
                        'sub_organization_id': None,
                        'appid': 'ray',
                        'project_id': 'CDE2F42-5B87-C594-C900E578C',
                        'rule_id': '1838',
                        'node_id': None,
                        'metadata': {
                            'campaign_title': 'AF',
                            'project_title': 'AF '
                        }
                    }
                ]
            }
        ],
        'render': [{
                'attribs': {
                    'website': 'github.com/res',
                    'version': '1.0',
                    'type': 'Page Render'
                },
                'results': [{
                        'render_status': 'success',
                        'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
                        'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
                        'url': 'https://discooprdapp.com/',
                        'load_time': 32
                    }
                ]
            }
        ]
    },
    'norm_attribs': {
        'website': 'github.com/res',
        'version': '1.1',
        'type': 'crawl'
    },
    'project_id': 'default',
    'system_timestamp': '2019-02-22T19:04:53.569623',
    'doc': {
        'appid': 'subtter',
        'links': [],
        'response_url': 'https://discooprdapp.com',
        'url': 'https://discooprdapp.com/',
        'status_code': 200,
        'status_msg': 'OK',
        'encoding': 'utf-8',
        'attrs': {
            'uid': '2ab8f2651cb32261b911c990a8b'
        },
        'timestamp': '2019-02-22T19:04:53.963',
        'crawlid': '7fd95-785-4dd259-fcc-8752f'
    },
    'type': 'crawl',
    'norm': {
        'body': '\n',
        'domain': 'discordapp.com',
        'author': 'crawl',
        'url': 'https://discooprdapp.com',
        'timestamp': '2019-02-22T19:04:53.961283+00:00',
        'id': '7fc5-685-4dd9-cc-8762f'
    }
}
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158

2 Answers2

3

Go to the _source:

_source to list:

  • Given the sample data from the question
    • create a list of all the rows in _source

enter image description here

_source_list = df._source.tolist()

Use recursion to flatten the nested dicts

def flatten_json(nested_json: dict, exclude: list=['']) -> dict:
    """
    Flatten a list of nested dicts.
    """
    out = dict()
    def flatten(x: (list, dict, str), name: str='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude:
                    flatten(x[a], f'{name}{a}_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, f'{name}{i}_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out

Using flatten_json:

df_source = pd.DataFrame([flatten_json(x) for x in _source_list])
  • In this case, the final result will be a dataframe with 52 columns
sub_organization_id                                             uid project_veid campaign_id organization_id meta_rule_matcher_0_atribs_website meta_rule_matcher_0_atribs_source meta_rule_matcher_0_atribs_version meta_rule_matcher_0_atribs_type meta_rule_matcher_0_results_0_rule_type meta_rule_matcher_0_results_0_rule_tag meta_rule_matcher_0_results_0_description meta_rule_matcher_0_results_0_project_veid meta_rule_matcher_0_results_0_campaign_id meta_rule_matcher_0_results_0_value meta_rule_matcher_0_results_0_organization_id meta_rule_matcher_0_results_0_sub_organization_id meta_rule_matcher_0_results_0_appid meta_rule_matcher_0_results_0_project_id meta_rule_matcher_0_results_0_rule_id meta_rule_matcher_0_results_0_node_id meta_rule_matcher_0_results_0_metadata_campaign_title meta_rule_matcher_0_results_0_metadata_project_title meta_render_0_attribs_website meta_render_0_attribs_version meta_render_0_attribs_type meta_render_0_results_0_render_status                                                             meta_render_0_results_0_path meta_render_0_results_0_image_hash meta_render_0_results_0_url  meta_render_0_results_0_load_time norm_attribs_website norm_attribs_version norm_attribs_type project_id            system_timestamp doc_appid          doc_response_url                    doc_url  doc_status_code doc_status_msg doc_encoding                doc_attrs_uid            doc_timestamp                 doc_crawlid   type norm_body     norm_domain norm_author                  norm_url                    norm_timestamp                 norm_id
            default  ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b      default     default         default                     github.com/res                          Explicit                                1.1                           crawl                                 hashtag                                    Far                                      None               A7180EA-7078-0C7F-ED5D-86AD7              2A6DA0C-365BB-67DD-B05830920                                #Far                                          None                                              None                                 ray              CDE2F42-5B87-C594-C900E578C                                  1838                                  None                                                    AF                                                  AF                 github.com/res                           1.0                Page Render                               success  https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg    bb7674b8ea3fc05bfd027a19815f82c   https://discooprdapp.com/                                 32       github.com/res                  1.1             crawl    default  2019-02-22T19:04:53.569623   subtter  https://discooprdapp.com  https://discooprdapp.com/              200             OK        utf-8  2ab8f2651cb32261b911c990a8b  2019-02-22T19:04:53.963  7fd95-785-4dd259-fcc-8752f  crawl        \n  discordapp.com       crawl  https://discooprdapp.com  2019-02-22T19:04:53.961283+00:00  7fc5-685-4dd9-cc-8762f
            default  ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b      default     default         default                     github.com/res                          Explicit                                1.1                           crawl                                 hashtag                                    Far                                      None               A7180EA-7078-0C7F-ED5D-86AD7              2A6DA0C-365BB-67DD-B05830920                                #Far                                          None                                              None                                 ray              CDE2F42-5B87-C594-C900E578C                                  1838                                  None                                                    AF                                                  AF                 github.com/res                           1.0                Page Render                               success  https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg    bb7674b8ea3fc05bfd027a19815f82c   https://discooprdapp.com/                                 32       github.com/res                  1.1             crawl    default  2019-02-22T19:04:53.569623   subtter  https://discooprdapp.com  https://discooprdapp.com/              200             OK        utf-8  2ab8f2651cb32261b911c990a8b  2019-02-22T19:04:53.963  7fd95-785-4dd259-fcc-8752f  crawl        \n  discordapp.com       crawl  https://discooprdapp.com  2019-02-22T19:04:53.961283+00:00  7fc5-685-4dd9-cc-8762f
            default  ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b      default     default         default                     github.com/res                          Explicit                                1.1                           crawl                                 hashtag                                    Far                                      None               A7180EA-7078-0C7F-ED5D-86AD7              2A6DA0C-365BB-67DD-B05830920                                #Far                                          None                                              None                                 ray              CDE2F42-5B87-C594-C900E578C                                  1838                                  None                                                    AF                                                  AF                 github.com/res                           1.0                Page Render                               success  https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg    bb7674b8ea3fc05bfd027a19815f82c   https://discooprdapp.com/                                 32       github.com/res                  1.1             crawl    default  2019-02-22T19:04:53.569623   subtter  https://discooprdapp.com  https://discooprdapp.com/              200             OK        utf-8  2ab8f2651cb32261b911c990a8b  2019-02-22T19:04:53.963  7fd95-785-4dd259-fcc-8752f  crawl        \n  discordapp.com       crawl  https://discooprdapp.com  2019-02-22T19:04:53.961283+00:00  7fc5-685-4dd9-cc-8762f
            default  ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b      default     default         default                     github.com/res                          Explicit                                1.1                           crawl                                 hashtag                                    Far                                      None               A7180EA-7078-0C7F-ED5D-86AD7              2A6DA0C-365BB-67DD-B05830920                                #Far                                          None                                              None                                 ray              CDE2F42-5B87-C594-C900E578C                                  1838                                  None                                                    AF                                                  AF                 github.com/res                           1.0                Page Render                               success  https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg    bb7674b8ea3fc05bfd027a19815f82c   https://discooprdapp.com/                                 32       github.com/res                  1.1             crawl    default  2019-02-22T19:04:53.569623   subtter  https://discooprdapp.com  https://discooprdapp.com/              200             OK        utf-8  2ab8f2651cb32261b911c990a8b  2019-02-22T19:04:53.963  7fd95-785-4dd259-fcc-8752f  crawl        \n  discordapp.com       crawl  https://discooprdapp.com  2019-02-22T19:04:53.961283+00:00  7fc5-685-4dd9-cc-8762f
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
0
pd.io.json.json_normalize(source.columnName.apply(json.loads))
chessosapiens
  • 3,159
  • 10
  • 36
  • 58