1

I want to create a dataFrame using a loop function over a list that is composed of dictionary and another list.

list = [{'sitemap': [{'path': 'http://test.com',
    'errors': '0',
    'contents': [{'type': 'web', 'submitted': '34801', 'indexed': '4656'}]}]},
 {'sitemap': [{'path': 'https://example.com',
    'errors': '0',
    'contents': [{'type': 'web', 'submitted': '2329'}]}]}]

Originally, this is what I tried:

data_for_df = []

for each in list:
    temp = []
    temp.append(each['sitemap'][0]['path'])
    temp.append(each['sitemap'][0]['errors'])
    temp.append(each['sitemap'][0]['contents'][0]['type'])
    temp.append(each['sitemap'][0]['contents'][0]['submitted'])
    temp.append(each['sitemap'][0]['contents'][0]['indexed'])
    data_for_df.append(temp)

df = pd.DataFrame(data_for_df, columns =['path','lastSubmitted','type','submitted'])

However, I found that this query returns error because sometimes there are missing key:values. In this example, key:value pairs for 'indexed'is missing. When this happens, I want to return empty or replace it with null value. Could anyone help me with this?

  • `each['sitemap'][0]['contents'][0].get('index')` will return `None` if the key is missing. You can also specify another default as the optional argument to `.get()` – Barmar Jul 28 '23 at 16:19
  • Good practices: avoid assigning Python functions as variables, for instance [`list( )`](https://docs.python.org/3/tutorial/datastructures.html). Find something else. – OCa Jul 28 '23 at 16:40
  • Would https://stackoverflow.com/q/74471768/12846804 be an option? Also, relevant for getting dictionary items: https://stackoverflow.com/q/71460721/12846804 – OCa Aug 01 '23 at 08:44

1 Answers1

0

Maybe using pd.json_normalize will be enough in this case:

lst = [
    {
        "sitemap": [
            {
                "path": "http://test.com",
                "errors": "0",
                "contents": [{"type": "web", "submitted": "34801", "indexed": "4656"}],
            }
        ]
    },
    {
        "sitemap": [
            {
                "path": "https://example.com",
                "errors": "0",
                "contents": [{"type": "web", "submitted": "2329"}],
            }
        ]
    },
]

df = pd.json_normalize(lst, ['sitemap', ['contents']], [['sitemap', 'path'], ['sitemap', 'errors']])
print(df)

Prints:

  type submitted indexed         sitemap.path sitemap.errors
0  web     34801    4656      http://test.com              0
1  web      2329     NaN  https://example.com              0
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thank you @andrej. I found that some entries are missing 'contents' and therefore returning an error. In this case, can I still use json_normalize? – Anchobi_codes Jul 31 '23 at 15:57
  • After all, I decided to fill in the missing key:value pairs first using loop and if functions, and then json_normalize later. Thank you for the help. – Anchobi_codes Aug 01 '23 at 09:49