I have concat multiple data sources and each data source has it's own unique columns based on the data source. I want to merge rows who match on a list of columns and keep all unique data sources in the same row.
Example:
df1 = pd.DataFrame({'SharedData': ['A', 'B', 'C', 'D', 'E'],
'df1Data': ['1', '2', '3', '4', '5']})
df2 = pd.DataFrame({'SharedData': ['D', 'E', 'F', 'G', 'H'],
'df2Data': ['4', '5', '6', '7', '8']})
newdf = pd.concat([df1,df2], axis=0, ignore_index=True)
I need the resulting data set to go from before and after below.
Before Data Set:
SharedData | df1Data | df2Data |
---|---|---|
A | 1 | |
B | 2 | |
C | 3 | |
D | 4 | |
E | 5 | |
D | 4 | |
E | 5 | |
F | 6 | |
G | 7 | |
H | 8 |
After Data Set:
SharedData | df1Data | df2Data |
---|---|---|
A | 1 | |
B | 2 | |
C | 3 | |
D | 4 | 4 |
E | 5 | 5 |
F | 6 | |
G | 7 | |
H | 8 |
I need to deduplicate rows where SharedData matches, with a new row that contains all of the df specific data.