I have a dataset on emails that looks something like this:
email_data = [{
"id": "random_id_1",
"to": [{"name": "A", "email": "a@example.com"}],
"cc": [{"name": "B", "email": "b@example.com"}, {"name": "C", "email": "c@example.com"}],
"bcc": [] },
{ "id": "random_id_2",
"to": [{"name": "D", "email": "d@example.com"}],
"cc": [{"name": "E", "email": "e@example.com"}],
"bcc": [] },
{ "id": "random_id_3",
"to": [{"name": "F", "email": "f@example.com"}],
"cc": [{"name": "G", "email": "g@example.com"}, {"name": "H", "email": "h@example.com"}],
"bcc": [] }]
I need to be able to search a large database like this on the email address in the "to" or "cc" field but I am not able to effectively parse them out.
So far I have managed to get this:
import pandas as pd
email_df = pd.DataFrame(email_data)
cc_df = pd.DataFrame(email_df['cc'].to_list())
pd.json_normalize(cc_df[0])
This does the job for the first email address of the "cc" field, but there can be a large number of email addresses in that field and I don't know how I would do this for all of them. (The dataset is dynamic, so I cannot hardcode a max number of columns)
Another drawback from this method is that as soon as I move to the second column on the "cc" field, it fails because one of the values is None. This happens with the following line of code:
pd.json_normalize(cc_df[1])