Concatenate two dataframes and drop duplicates in Pandas

Question

For df2 which only has data in the year of 2019:

  type  year  value
0    a  2019     13
1    b  2019      5
2    c  2019      5
3    d  2019     20

df1 has multiple years data:

  type  year  value
0    a  2015     12
1    a  2016      2
2    a  2019      3
3    b  2018     50
4    b  2019     10
5    c  2017      1
6    c  2016      5
7    c  2019      8

I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.

The expected result will like this:

  type  date  value
0    a  2015     12
1    a  2016      2
2    b  2018     50
3    c  2017      1
4    c  2016      5
5    a  2019     13
6    b  2019      5
7    c  2019      5
8    d  2019     20

The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.

   type  date  value
0     a  2019     13
1     b  2019      5
2     c  2019      5
3     d  2019     20
4     a  2015     12
5     a  2016      2
6     a  2019      3
7     b  2018     50
8     b  2019     10
9     c  2017      1
10    c  2016      5
11    c  2019      8

score 4 · Accepted Answer · answered Aug 30 '19 at 08:01

4

Add DataFrame.drop_duplicates for get last rows per type and date after concat.

Solution working if type and date pairs are unique in both DataFrames.

df = (pd.concat([df1, df2], ignore_index=True, sort =False)
        .drop_duplicates(['type','date'], keep='last'))

answered Aug 30 '19 at 08:01

jezrael

822,522
95
1,334
1,252

Does `keep='last'` mean that `value`s from `df1` will be taken instead of `df2`? What if I write `df = (pd.concat([df2, df1], ignore_index=True, sort =False).drop_duplicates(['type','date'], keep='last'))` – ah bon Aug 30 '19 at 08:05
1

@ahbon - It means it get last value of concanecated rows, so here from `df2` if exist, if not exist from df1. – jezrael Aug 30 '19 at 08:06
point - after pd.concat([df1, df2], ignore_index=True, sort =False) the values from df1 are the top 4 rows so keep='last' keeps the original values and doesn't updates with the new values.. – iamklaus Aug 30 '19 at 08:12
@iamklaus - yes, just see it. If test OP output from `pd.concat([df1, df2], ignore_index=True, sort =False)` it means `df1 is df2` and `df2 is df1`, just edit question. – jezrael Aug 30 '19 at 08:13
1

`I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.`.. best would be to change the names :) +1 – iamklaus Aug 30 '19 at 08:16
Sorry, if `df1` and `df2` don't have same columns, for example, if `df1` has more columns `a, b, c`, it seems doesn't work out. I have tried with `df = (pd.concat([df1, df2], ignore_index=True, sort =False) .drop_duplicates(['type','date', 'a', `b`, `c`], keep='last'))`, duplicates rows remain. Should I change something else? – ah bon Aug 30 '19 at 08:47
1

Can you change `df = (pd.concat([df1, df2], ignore_index=True, sort =False) .drop_duplicates(['type','date', 'a', b, c], keep='last'))` to `df = (pd.concat([df1, df2[df2.columns.intersection(df1.columns)], ignore_index=True, sort =False) .drop_duplicates(['type','date', 'a', b, c], keep='last'))` - [link](https://stackoverflow.com/a/48539234/2901002) – jezrael Aug 30 '19 at 08:51

Concatenate two dataframes and drop duplicates in Pandas

1 Answers1

Linked

Related