0

I have a dataset on emails that looks something like this:

email_data = [{ 
  "id": "random_id_1", 
  "to": [{"name": "A", "email": "a@example.com"}], 
  "cc": [{"name": "B", "email": "b@example.com"}, {"name": "C", "email": "c@example.com"}], 
  "bcc": [] },
{ "id": "random_id_2", 
  "to": [{"name": "D", "email": "d@example.com"}], 
  "cc": [{"name": "E", "email": "e@example.com"}], 
  "bcc": [] },    
{ "id": "random_id_3", 
  "to": [{"name": "F", "email": "f@example.com"}], 
  "cc": [{"name": "G", "email": "g@example.com"}, {"name": "H", "email": "h@example.com"}], 
  "bcc": [] }]

I need to be able to search a large database like this on the email address in the "to" or "cc" field but I am not able to effectively parse them out.

So far I have managed to get this:

import pandas as pd
email_df = pd.DataFrame(email_data)
cc_df = pd.DataFrame(email_df['cc'].to_list())
pd.json_normalize(cc_df[0])

This does the job for the first email address of the "cc" field, but there can be a large number of email addresses in that field and I don't know how I would do this for all of them. (The dataset is dynamic, so I cannot hardcode a max number of columns)

Another drawback from this method is that as soon as I move to the second column on the "cc" field, it fails because one of the values is None. This happens with the following line of code:

pd.json_normalize(cc_df[1])
Vincent
  • 1,137
  • 18
  • 40
  • pandas is not very useful in this situation. If your objective is to be able to search on `email` fields in `to` or `cc`, you should refer to this answer https://stackoverflow.com/a/41496646/5907936 and objectpath library – Yati Raj Jun 29 '20 at 16:08

1 Answers1

0

You wanna receive a list of all email addresses in to and cc, right?
The following should serve that need:

import pandas as pd

# Read data
email_df = pd.DataFrame(email_data)

# Split data in 'to'/'cc' into new columns
df_to = email_df["to"].explode().apply(pd.Series)
df_cc = email_df["cc"].explode().apply(pd.Series)

# Write columns 'email' to list
df_to["email"].to_list()
> ['a@example.com', 'd@example.com', 'f@example.com']

df_cc["email"].to_list()
> ['b@example.com', 'c@example.com', 'e@example.com', 'g@example.com', 'h@example.com']
zuecho
  • 118
  • 1
  • 10