Pandas Drop Specified Duplicates After Concat

Question

I'm trying to write a python script that concats two csv files and then drops the duplicate rows. Here is an example of the csv's I'm concating:

csv_1

type    state    city    date        estimate    id
lux     tx       dal     2019/08/15  .8273452    10
sed     ny       ny      2019/05/12  .624356     10
cou     cal      la      2013/04/24  .723495     10
.       .        .       .           .           .
.       .        .       .           .           .

csv_2

type    state    city    date        estimate    id
sed     col      den     2013/05/02  .7234957    232
sed     mi       det     2015/11/17  .4249357    232
lux     nj       al      2009/02/29  .627234     232
.       .        .       .           .           .
.       .        .       .           .           .

As of now, my code to concat these two together looks like this:

csv_1 = pd.read_csv('csv_1.csv')
csv_2 = pd.read_csv('csv_2.csv')
union_df = pd.concat([csv_1, csv_2])
union_df.drop_duplicates(subset=['type', 'state', 'city', 'date'], inplace=True, keep='first')

Is there any way I can ensure only rows with id = 232 are deleted and none with id = 10 are? Just a way to specify only rows from the second csv are removed from the concatenated csv?

Thank you

https://stackoverflow.com/questions/48647534/python-pandas-find-difference-between-two-data-frames/48647840#48647840 — BENY, Aug 15 '19 at 21:35

score 3 · Accepted Answer · answered Aug 15 '19 at 21:35

3

Use, duplicated and boolean logic:

union_df.loc[~union_df.duplicated(subset=['type','state','city','date'], keep='first') & (union_df['id'] == 233)]

answered Aug 15 '19 at 21:35

Scott Boston

147,308
15
139
187

score 0 · Answer 2 · answered Aug 15 '19 at 21:36

Instead of directly dropping the duplicates using the drop_duplicates method, I would recommend you using the duplicated method. The latter works the same way as the first but it returns a boolean vector indicating which rows are duplicated. Once you call it, you can combine its output with the id for achieving your purpose. Take a look below.

csv_1 = pd.read_csv('csv_1.csv')
csv_2 = pd.read_csv('csv_2.csv')
union_df = pd.concat([csv_1, csv_2])
union_df["dups"]= union_df.duplicated(subset=['type', 'state', 'city', 'date'], 
                                      inplace=True, keep='first')

union_df = union_df.loc[lambda d: ~((d.dups) & (d.id==232))]

Pandas Drop Specified Duplicates After Concat

2 Answers2