0

I have about 300.000 rows as below, but what I need is only id and email address. Dataframe like this:

d = {'vid': [1201,1202], 'col2': [[{'vid': 1201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0,
 'identities': [{'type': 'EMAIL', 'value': 'abc@gmaill.com', 'timestamp': 1548608578090, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': '69c4f6ec-e0e9-4632-8d16-cbc204a57b22', 'timestamp': 1548608578106}]},
{'vid': 314479851, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 183374504, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17543251, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 99700201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 65375052, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17525601, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 238128701, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []}],

[{'vid': 1202, 'saved-at-timestamp': 1548608578109, 'deleted-changed-timestamp': 0, 
'identities': [{'type': 'EMAIL', 'value': 'xyz@gmaill.com', 'timestamp': 1548608578088, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': 'fe6c2628-b1db-47c5-91f6-258e79ea58f0', 'timestamp': 1548608578106}]}]]}

df=pd.DataFrame(d)
df

vid   col2
1201    [{'vid': 1201, 'saved-at-timestamp': 1638824550030........
1202    [{'vid': 1202, 'saved-at-timestamp': 1548608578109......

expected output (only two fields but for all rows):

vid   email
1201  abc@gmaill.com
1202  xyz@gmaill.com
..    ..

I tried to apply the solutions here but it didn't work

Bushmaster
  • 4,196
  • 3
  • 8
  • 28

2 Answers2

1

Here's one way using json_normalize:

out = (pd.concat(pd.json_normalize(lst, ['identities'], 'vid') for lst in d['col2'])
       .pipe(lambda x: x[x['type']=='EMAIL'])[['vid','value']]
       .rename(columns={'value':'email'}))

or just use str accessor repeatedly for only the "emails":

df=pd.DataFrame(d)
df['email'] = df['col2'].str[0].str.get('identities').str[0].str.get('value')
df = df.drop(columns='col2')

Output:

    vid           email
0  1201  abc@gmaill.com
0  1202  xyz@gmaill.com
  • i'm sorry but it works only first row. i need this for all rows. Is there something wrong with the question? I will edit if. – Bushmaster Mar 21 '22 at 19:51
  • you're right. There are about 300,000 rows in the same format as the first row, I'll update the question that way. – Bushmaster Mar 21 '22 at 19:55
  • it's working. but i have a new problem. before saving this data as csv this code was working. but now when i read csv as dataframe and run the code i get ` TypeError: string indices must be integers` . do you have an idea for this ? – Bushmaster Mar 22 '22 at 06:56
1

You can use pd.json_normalize:

df = pd.json_normalize([sub for item in d['col2'] for sub in item], record_path='identities', meta='vid')

Output:

>>> df
        type                                 value      timestamp is-primary   vid
0      EMAIL                        abc@gmaill.com  1548608578090       True  1201
1  LEAD_GUID  69c4f6ec-e0e9-4632-8d16-cbc204a57b22  1548608578106        NaN  1201
2      EMAIL                        xyz@gmaill.com  1548608578088       True  1202
3  LEAD_GUID  fe6c2628-b1db-47c5-91f6-258e79ea58f0  1548608578106        NaN  1202

And now just use .loc to get the data you want:

df = df.loc[df['type'] == 'EMAIL', ['vid', 'value']]

Output:

>>> df
    vid           value
0  1201  abc@gmaill.com
2  1202  xyz@gmaill.com

Or you can pivot the dataframe after using json_normalize, instead of using .loc:

df = df.pivot(index='vid', columns='type', values='value').rename_axis(None, axis=1).reset_index()

Output:

>>> df
    vid           EMAIL                             LEAD_GUID
0  1201  abc@gmaill.com  69c4f6ec-e0e9-4632-8d16-cbc204a57b22
1  1202  xyz@gmaill.com  fe6c2628-b1db-47c5-91f6-258e79ea58f0
  • 1
    this solution is for single row. I needed it for all rows. I think there is something missing in my question. I updated it. – Bushmaster Mar 21 '22 at 19:58
  • Updated @Clegane. Check now :) –  Mar 21 '22 at 20:26
  • this works faster than @enke 's solution.. Unfortunately i have a new problem. before saving this data as csv this code was working. but now when i read csv as dataframe and run the code i get ` TypeError: string indices must be integers` . how can i solve this ? – Bushmaster Mar 22 '22 at 06:58
  • What does your CSV file look like? Can you provide a sample of it in the question? –  Mar 22 '22 at 16:08
  • i've retrieved the data from its source again. now there is no problem. everything is working. – Bushmaster Mar 23 '22 at 08:57