Pandas splitting list with unorganised dictionaries

Question

I have a dataframe like this

id            Merchant ID  Date                 App Details
601179aa      1           27/01/21 20:03  
[{:appName "FAU-G", :packageName "com.ncoregames.faug"} 
 {:appName "Truecaller", :packageName "com.truecaller"}]

I want output like this

id           Merchant ID  Date                 App Name  Package Name
601179aa          1          27/01/21 20:03        FAU-G com.ncoreagames.faug
601179aa          1          27/01/21 20:03.   True Caller com.truecaller

I tried

df['App Details'] = df['App Details'].str.replace(r"\[","")
df['App Details'] = df['App Details'].str.replace(r"\]","")
foo = lambda x: pd.Series([i for i in (x.split(' '))])
app_df = df['App Details'].apply(foo)

the number of dictionaries in list keeps on varying.

please post your data as text. see this post for how to make good questions related to pandas: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Haleemur Ali, Jan 31 '21 at 13:18
Are you sure that the string is correct? there is no consistency. "FAU-G. is missing " i think — Billy Bonaros, Jan 31 '21 at 13:55
By "length of the dictionary keeps varying" do you mean the number of times a give ID is references or do you mean the number of keys in the dictionary? — Tom, Jan 31 '21 at 14:30
Length of list keeps varying I mean the number of dictionaries in a list keeps varying. — abhishek gupta, Jan 31 '21 at 14:32
I take it this is being imported from a json or csv or some such and that's why it is formatted in this manner? — Tom, Jan 31 '21 at 14:33
It is being imported from mongo database to csv and then reading in python — abhishek gupta, Jan 31 '21 at 14:35

score 2 · Accepted Answer · edited Jan 31 '21 at 15:24

2

You can do it using regular expressions:

df=pd.DataFrame({'id': {0: '601179aa'},
 'Merchant': {0: 1},
 'ID': {0: '27/01/21'},
 'Date': {0: '20:03'},
 'App Details': {0: '[{:appName"FAU-G".:packageName"com.ncoreagames.faug"}{:appName"Truecaller",:packageName"com.truecaller"}]'}})

import re

df['App Details']=df['App Details'].str.split("}{")

    
df=df.explode('App Details')

df['App Name']=df['App Details'].apply(lambda x:re.findall('appName"(.+?)"',x)).explode()
df['Package Name']=df['App Details'].apply(lambda x:re.findall('packageName"(.+?)"',x)).explode()


     App Name          Package Name
0       FAU-G  com.ncoreagames.faug
0  Truecaller        com.truecaller

edited Jan 31 '21 at 15:24

yudhiesh

6,383
3
16
49

answered Jan 31 '21 at 14:25

Billy Bonaros

1,671
11
18

this did not work for me I am getting NaN in App Name and Package Name – abhishek gupta Jan 31 '21 at 15:15
Then, something is wrong with your input, I copy and paste it – Billy Bonaros Jan 31 '21 at 15:18
@abhishekgupta try to run the whole code with the creation of the data frame and tell me if it works – Billy Bonaros Jan 31 '21 at 15:21
It is actually the spaces I forgot to give spaces. I have edited the input – abhishek gupta Jan 31 '21 at 15:48

score 1 · Answer 2 · answered Jan 31 '21 at 16:09

You can start by:

Expanding you cells into multiple rows:

df1 = pd.concat([pd.Series(row['id'], row['App'].split('}{'))        
                  for _, row in df.iterrows()]).reset_index().rename(columns = {'index':'temp1', 0:'id'})
df1['temp1'] = df1['temp1'].str.replace(r'\[{|}\]', '')
df1

To get this (Notice the change in column names):

temp1   id
0   :appName"FAU-G",:packageName"com.ncoreagames.f...   601179aa
1   :appName"Truecaller",:packageName"com.truecaller"   601179aa

Split the new rows string into multiple columns: Here The code is also getting the names for the columns and renaming them appropriately (assumption is that all your rows start with :appName or :packageName - Other wise adjust the regex inside the findall step)

df2 = df1["temp1"].str.split(",", n= 1 ,expand = True) 
names = []
for col in df2.columns:
    name = re.findall('(?<=\:)(.*?)(?=\")',df2[col][0])[0]
    df2.loc[:,col] = df2[col].str.replace(name, '')
    df2.loc[:,col] = df2[col].str.replace('\"|\:', '')
    names.append(name)
df2.columns = names
df2

You get this:

    appName packageName
0   FAU-G   com.ncoreagames.faug
1   Truecaller  com.truecaller

Putting it all together using the new column names:

finaldf = df[['id', 'Merchant','ID',    'Date', 'Details']].merge(df1[['id']].merge(df2, left_index = True, right_index = True, how = 'inner'), on = 'id')
finaldf


id  Merchant    ID  Date    Details appName packageName
0   601179aa    1   27/01/21    20:03   NaN FAU-G   com.ncoreagames.faug
1   601179aa    1   27/01/21    20:03   NaN Truecaller  com.truecaller

score 1 · Answer 3 · answered Jan 31 '21 at 17:43

Your string looks very similar to json. One approach might be to convert the string to valid json & read into a dict. then explode, and use json_normalize to extract the information, finally setting the index to the correct value.

The one benefit of this approach when compared to the accepted answer is that it is more general. The way this approach might fail is if your structure is not easily convertible to json.

In the example, I've assumed that the keys are all words and there is no , between records, but there may be several spaces (actually non-alphabetical characters).

import json
import pandas as pd

df = pd.DataFrame({
        'id': ['601179aa'],
        'Merchant ID': [1],
        'Date': ['27/01/21 20:03'],
        'App Details': [
            '[{:appName "FAU-G", :packageName "com.ncoreagames.faug"}{:appName "Truecaller", :packageName "com.truecaller"}]'
            ]
})

idx_cols = ['id', 'Merchant ID', 'Date']

df2 = df.set_index(idx_cols)['App Details'] \
  .str.replace(':(\w+)', '"\\1":') \
  .str.replace('}\W*{', '},{') \
  .apply(json.loads).explode()

df3 = pd.json_normalize(df2).set_index(df2.index).reset_index()

df3 outputs:

         id  Merchant ID            Date     appName           packageName
0  601179aa            1  27/01/21 20:03       FAU-G  com.ncoreagames.faug
1  601179aa            1  27/01/21 20:03  Truecaller        com.truecaller

Pandas splitting list with unorganised dictionaries

3 Answers3