How to remove duplicates in pandas?

Question

I have lots of data in excel files. I would like to concatenate these datas into one excel file by removing duplicate records according to id column information.

df1 
   id    name   date 
0   1    cab    2017
1  11    den    2012 
2  13    ers    1998


df2 
   id    name   date 
0  11    den    2012
1  14    ces    2011 
2   4    guk    2007

I want to have below concantenated file finally.

Concat df
   id    name   date 
0   1    cab    2017
1  11    den    2012 
2  13    ers    1998
1  14    ces    2011 
2   4    guk    2007

I try below but it does not remove duplicates. Can anyone advise how to fix this ?

pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

My concatenated data are as below. Duplicated ids are still on the file.

    id                  created_at          retweet_count
0   721557296757797000  2016-04-17 04:34:00 21
1   721497712726844000  2016-04-17 00:37:14 94
2   721462059515453000  2016-04-16 22:15:33 0
3   721460623285072000  2016-04-16 22:09:51 0
4   721460397241446000  2016-04-16 22:08:57 0
5   721459817651577000  2016-04-16 22:06:39 0
6   721456334894469000  2016-04-16 21:52:48 0
7   721557296757797000  2016-04-17 04:34:00 21
8   721497712726844000  2016-04-17 00:37:14 94
9   721462059515453000  2016-04-16 22:15:33 0
10  721460623285072000  2016-04-16 22:09:51 0
11  721460397241446000  2016-04-16 22:08:57 0
12  721459817651577000  2016-04-16 22:06:39 0
13  721456334894469000  2016-04-16 21:52:48 0

Possible duplicate of [Drop all duplicate rows in Python Pandas](https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas) — Herpes Free Engineer, Jul 04 '18 at 17:06

jezrael · Accepted Answer · 2016-04-17T11:36:59.207

I think you need add parameter subset to drop_duplicates for filtering by column id:

print pd.concat([df1,df2]).drop_duplicates(subset='id').reset_index(drop=True)
   id name  date
0   1  cab  2017
1  11  den  2012
2  13  ers  1998
3  14  ces  2011
4   4  guk  2007

EDIT:

I try your new data and for me it works:

import pandas as pd

df = pd.DataFrame({'created_at': {0: '2016-04-17 04:34:00', 1: '2016-04-17 00:37:14', 2: '2016-04-16 22:15:33', 3: '2016-04-16 22:09:51', 4: '2016-04-16 22:08:57', 5: '2016-04-16 22:06:39', 6: '2016-04-16 21:52:48', 7: '2016-04-17 04:34:00', 8: '2016-04-17 00:37:14', 9: '2016-04-16 22:15:33', 10: '2016-04-16 22:09:51', 11: '2016-04-16 22:08:57', 12: '2016-04-16 22:06:39', 13: '2016-04-16 21:52:48'}, 'retweet_count': {0: 21, 1: 94, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 21, 8: 94, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0}, 'id': {0: 721557296757797000, 1: 721497712726844000, 2: 721462059515453000, 3: 721460623285072000, 4: 721460397241446000, 5: 721459817651577000, 6: 721456334894469000, 7: 721557296757797000, 8: 721497712726844000, 9: 721462059515453000, 10: 721460623285072000, 11: 721460397241446000, 12: 721459817651577000, 13: 721456334894469000}},
                  columns=['id','created_at','retweet_count'])

print df
                    id           created_at  retweet_count
0   721557296757797000  2016-04-17 04:34:00             21
1   721497712726844000  2016-04-17 00:37:14             94
2   721462059515453000  2016-04-16 22:15:33              0
3   721460623285072000  2016-04-16 22:09:51              0
4   721460397241446000  2016-04-16 22:08:57              0
5   721459817651577000  2016-04-16 22:06:39              0
6   721456334894469000  2016-04-16 21:52:48              0
7   721557296757797000  2016-04-17 04:34:00             21
8   721497712726844000  2016-04-17 00:37:14             94
9   721462059515453000  2016-04-16 22:15:33              0
10  721460623285072000  2016-04-16 22:09:51              0
11  721460397241446000  2016-04-16 22:08:57              0
12  721459817651577000  2016-04-16 22:06:39              0
13  721456334894469000  2016-04-16 21:52:48              0

print df.dtypes

id                int64
created_at       object
retweet_count     int64
dtype: object


print df.drop_duplicates(subset='id').reset_index(drop=True)
                   id           created_at  retweet_count
0  721557296757797000  2016-04-17 04:34:00             21
1  721497712726844000  2016-04-17 00:37:14             94
2  721462059515453000  2016-04-16 22:15:33              0
3  721460623285072000  2016-04-16 22:09:51              0
4  721460397241446000  2016-04-16 22:08:57              0
5  721459817651577000  2016-04-16 22:06:39              0
6  721456334894469000  2016-04-16 21:52:48              0

It still does not work for me. It still list the duplicated id columns. @jezrael — Behzat, Apr 17 '16 at 08:54
You have duplicates in `id` column? If yes, are `dtypes` same of column `id` in each Dataframe? Check it by `print df.dtypes` — jezrael, Apr 17 '16 at 08:55
yes, dtypes are the same, all datas in id column are numbers. — Behzat, Apr 17 '16 at 08:57
Hmmm, sometimes numbers can be `string` - then `dtypes` are `object`. But if `dtypes` of all columns `id` are `int`, that it is weird. — jezrael, Apr 17 '16 at 08:59
id numbers are in this format in excel file: 7.21460623285072E+017 which means 721460623285072000. How can I check dtype of id column ? — Behzat, Apr 17 '16 at 09:05
Interesting. Maybe format is different - some are with `E` some not. — jezrael, Apr 17 '16 at 09:13
No jezrael, all numbers are with E format. I don' t get why it does not work. — Behzat, Apr 17 '16 at 10:00
I have just added concatenated data into my question part. The same data were in the different files; but when I concatenated it, it does not remove duplicates. — Behzat, Apr 17 '16 at 11:09
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/109368/discussion-between-jezrael-and-behzat). — jezrael, Apr 17 '16 at 11:37

score 1 · Answer 2 · answered Apr 17 '16 at 11:37

1

another way :

df1.append(df2).groupby('id').first()

answered Apr 17 '16 at 11:37

B. M.

18,243
2
35
54

How to remove duplicates in pandas?

2 Answers2