Pandas De-Duplicating without losing deleted data

Asked Jul 08 '22 at 04:02

Active Jul 08 '22 at 04:06

Viewed 12 times

I have concat multiple data sources and each data source has it's own unique columns based on the data source. I want to merge rows who match on a list of columns and keep all unique data sources in the same row.

Example:

df1 = pd.DataFrame({'SharedData': ['A', 'B', 'C', 'D', 'E'],
                    'df1Data': ['1', '2', '3', '4', '5']})

df2 = pd.DataFrame({'SharedData': ['D', 'E', 'F', 'G', 'H'],
                    'df2Data': ['4', '5', '6', '7', '8']})

newdf = pd.concat([df1,df2], axis=0, ignore_index=True)

I need the resulting data set to go from before and after below.

Before Data Set:

SharedData	df1Data	df2Data
A	1
B	2
C	3
D	4
E	5
D		4
E		5
F		6
G		7
H		8

After Data Set:

SharedData	df1Data	df2Data
A	1
B	2
C	3
D	4	4
E	5	5
F	6
G	7
H	8

I need to deduplicate rows where SharedData matches, with a new row that contains all of the df specific data.

edited Jul 08 '22 at 04:06

Quang Hoang

146,074
10
56
74

asked Jul 08 '22 at 04:02

Colt Mercer

`df1.merge(df2, on='SharedData', how='outer')`, then fillna on `df1Data` column with `df2Data` column. – Quang Hoang Jul 08 '22 at 04:06
Can 'on' be a list? My real data set has to match on 3 different columns – Colt Mercer Jul 08 '22 at 04:10
Yes, it can be a list `on=['col1','col2','col3']`. – Quang Hoang Jul 08 '22 at 04:10
last question. when I created each df I added a column named 'source' with df1 or df2. Is there a way to replace the source column with a list of both ['df1', 'df2'] – Colt Mercer Jul 08 '22 at 04:20

Pandas De-Duplicating without losing deleted data

0 Answers0