2

I have below data frame with duplicate data. I want to remove these duplicates from dataframe.

df = pd.DataFrame({'test_id': [
        {'user_id':2, 'insert_date':'2020-12-23', 'is_admin': "true"},
        {'user_id':4, 'insert_date':'2020-12-23', 'is_admin': "true"},
        {'user_id':3, 'insert_date':'2020-12-21', 'is_admin': "false"},
        {'user_id':2, 'insert_date':'2020-12-23', 'is_admin': "true"}
    ], 'contact_id':[1,4,2,1]}
)

print(df)

                                             test_id  contact_id
0  {'user_id': 2, 'insert_date': '2020-12-23', 'i...           1
1  {'user_id': 4, 'insert_date': '2020-12-23', 'i...           4
2  {'user_id': 3, 'insert_date': '2020-12-21', 'i...           2
3  {'user_id': 2, 'insert_date': '2020-12-23', 'i...           1

I have tried below to remove dupicates

df = df.drop_duplicates(subset=['test_id', 'contact_id'], keep='first')
print(df)

But getting below error

TypeError: unhashable type: 'dict'

Can anyone guide me how can I delete duplicate data based on 'test_id', 'contact_id' combination ?

I want below output

                                             test_id  contact_id
0  {'user_id': 2, 'insert_date': '2020-12-23', 'i...           1
1  {'user_id': 4, 'insert_date': '2020-12-23', 'i...           4
2  {'user_id': 3, 'insert_date': '2020-12-21', 'i...           2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Sangram Badi
  • 4,054
  • 9
  • 45
  • 78

2 Answers2

2

You can convert the values in column test_id to string first as follows before removing the duplicates:

df['test_id'] = df['test_id'].astype(str)

Then run your code:

df = df.drop_duplicates(subset=['test_id', 'contact_id'], keep='first')
print(df)

Output is same as what you desire:

                                             test_id  contact_id
0  {'user_id': 2, 'insert_date': '2020-12-23', 'i...           1
1  {'user_id': 4, 'insert_date': '2020-12-23', 'i...           4
2  {'user_id': 3, 'insert_date': '2020-12-21', 'i...           2

Edit

You can restore the data type of column test_id from str back to dict by the following code:

df['test_id'] = df['test_id'].map(eval)

Alternatively, you can also consider the following code, which is generally considered a safer option than using eval (if the data is from external source) :

import ast

df["test_id"] = df["test_id"].map(ast.literal_eval) 

See the post Using python's eval() vs. ast.literal_eval()? for more information.

After either piece of codes, you can perform your dict/json operations as usual like:

type(df.loc[2, 'test_id'])            gets: dict
df.loc[2, 'test_id']['insert_date']   gets: '2020-12-21'
df.loc[2, 'test_id']['user_id']       gets: 3

df['test_id'].to_list()               gets the list of dict instead of list of str:
    [{'user_id': 2, 'insert_date': '2020-12-23', 'is_admin': 'true'},
     {'user_id': 4, 'insert_date': '2020-12-23', 'is_admin': 'true'},
     {'user_id': 3, 'insert_date': '2020-12-21', 'is_admin': 'false'},
     {'user_id': 2, 'insert_date': '2020-12-23', 'is_admin': 'true'}]
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • You answer is correct, but converting to str will not use for other functionality. so @Jezrael answer is for accurate. so accepting his answer but I have upvoted your answer. – Sangram Badi Mar 05 '21 at 06:47
  • After converting to str, how can I convert it to JSON again for column test_id – Sangram Badi Mar 06 '21 at 09:52
  • 1
    @SangramBadi, try: `import ast` `df["test_id"] = df["test_id"].map(ast.literal_eval)` – SeaBean Mar 06 '21 at 13:42
  • 1
    @SangramBadi A simpler way is just using: `df['test_id'] = df['test_id'].map(eval)`. See my edit above. – SeaBean Mar 06 '21 at 18:38
1

You can use Series.duplicated and filter in boolean indexing by original values, so original dictionaries are not changed:

You can convert dictianaries to DataFrame by json_normalize:

print(pd.json_normalize(df['test_id']))
   user_id insert_date is_admin
0        2  2020-12-23     true
1        4  2020-12-23     true
2        3  2020-12-21    false
3        2  2020-12-23     true

df = df[~pd.json_normalize(df['test_id']).assign(new = df['contact_id']).duplicated()]
print (df)
                                             test_id  contact_id
0  {'user_id': 2, 'insert_date': '2020-12-23', 'i...           1
1  {'user_id': 4, 'insert_date': '2020-12-23', 'i...           4
2  {'user_id': 3, 'insert_date': '2020-12-21', 'i...           2

If order is always same working if convert to strings:

df = df[~df[['test_id']].astype(str).assign(new = df['contact_id'].duplicated()]
print (df)
                                             test_id  contact_id
0  {'user_id': 2, 'insert_date': '2020-12-23', 'i...           1
1  {'user_id': 4, 'insert_date': '2020-12-23', 'i...           4
2  {'user_id': 3, 'insert_date': '2020-12-21', 'i...           2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252