4

I have lots of data in excel files. I would like to concatenate these datas into one excel file by removing duplicate records according to id column information.

df1 
   id    name   date 
0   1    cab    2017
1  11    den    2012 
2  13    ers    1998


df2 
   id    name   date 
0  11    den    2012
1  14    ces    2011 
2   4    guk    2007

I want to have below concantenated file finally.

Concat df
   id    name   date 
0   1    cab    2017
1  11    den    2012 
2  13    ers    1998
1  14    ces    2011 
2   4    guk    2007

I try below but it does not remove duplicates. Can anyone advise how to fix this ?

pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

My concatenated data are as below. Duplicated ids are still on the file.

    id                  created_at          retweet_count
0   721557296757797000  2016-04-17 04:34:00 21
1   721497712726844000  2016-04-17 00:37:14 94
2   721462059515453000  2016-04-16 22:15:33 0
3   721460623285072000  2016-04-16 22:09:51 0
4   721460397241446000  2016-04-16 22:08:57 0
5   721459817651577000  2016-04-16 22:06:39 0
6   721456334894469000  2016-04-16 21:52:48 0
7   721557296757797000  2016-04-17 04:34:00 21
8   721497712726844000  2016-04-17 00:37:14 94
9   721462059515453000  2016-04-16 22:15:33 0
10  721460623285072000  2016-04-16 22:09:51 0
11  721460397241446000  2016-04-16 22:08:57 0
12  721459817651577000  2016-04-16 22:06:39 0
13  721456334894469000  2016-04-16 21:52:48 0
Behzat
  • 95
  • 3
  • 12
  • Possible duplicate of [Drop all duplicate rows in Python Pandas](https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas) – Herpes Free Engineer Jul 04 '18 at 17:06

2 Answers2

2

I think you need add parameter subset to drop_duplicates for filtering by column id:

print pd.concat([df1,df2]).drop_duplicates(subset='id').reset_index(drop=True)
   id name  date
0   1  cab  2017
1  11  den  2012
2  13  ers  1998
3  14  ces  2011
4   4  guk  2007

EDIT:

I try your new data and for me it works:

import pandas as pd

df = pd.DataFrame({'created_at': {0: '2016-04-17 04:34:00', 1: '2016-04-17 00:37:14', 2: '2016-04-16 22:15:33', 3: '2016-04-16 22:09:51', 4: '2016-04-16 22:08:57', 5: '2016-04-16 22:06:39', 6: '2016-04-16 21:52:48', 7: '2016-04-17 04:34:00', 8: '2016-04-17 00:37:14', 9: '2016-04-16 22:15:33', 10: '2016-04-16 22:09:51', 11: '2016-04-16 22:08:57', 12: '2016-04-16 22:06:39', 13: '2016-04-16 21:52:48'}, 'retweet_count': {0: 21, 1: 94, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 21, 8: 94, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0}, 'id': {0: 721557296757797000, 1: 721497712726844000, 2: 721462059515453000, 3: 721460623285072000, 4: 721460397241446000, 5: 721459817651577000, 6: 721456334894469000, 7: 721557296757797000, 8: 721497712726844000, 9: 721462059515453000, 10: 721460623285072000, 11: 721460397241446000, 12: 721459817651577000, 13: 721456334894469000}},
                  columns=['id','created_at','retweet_count'])
print df
                    id           created_at  retweet_count
0   721557296757797000  2016-04-17 04:34:00             21
1   721497712726844000  2016-04-17 00:37:14             94
2   721462059515453000  2016-04-16 22:15:33              0
3   721460623285072000  2016-04-16 22:09:51              0
4   721460397241446000  2016-04-16 22:08:57              0
5   721459817651577000  2016-04-16 22:06:39              0
6   721456334894469000  2016-04-16 21:52:48              0
7   721557296757797000  2016-04-17 04:34:00             21
8   721497712726844000  2016-04-17 00:37:14             94
9   721462059515453000  2016-04-16 22:15:33              0
10  721460623285072000  2016-04-16 22:09:51              0
11  721460397241446000  2016-04-16 22:08:57              0
12  721459817651577000  2016-04-16 22:06:39              0
13  721456334894469000  2016-04-16 21:52:48              0

print df.dtypes

id                int64
created_at       object
retweet_count     int64
dtype: object


print df.drop_duplicates(subset='id').reset_index(drop=True)
                   id           created_at  retweet_count
0  721557296757797000  2016-04-17 04:34:00             21
1  721497712726844000  2016-04-17 00:37:14             94
2  721462059515453000  2016-04-16 22:15:33              0
3  721460623285072000  2016-04-16 22:09:51              0
4  721460397241446000  2016-04-16 22:08:57              0
5  721459817651577000  2016-04-16 22:06:39              0
6  721456334894469000  2016-04-16 21:52:48              0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

another way :

df1.append(df2).groupby('id').first()
B. M.
  • 18,243
  • 2
  • 35
  • 54