I am trying to merge >=2
files with the same schema.
The files will contain duplicate entries but rows won't be identical, for example:
file1:
store_id,address,phone
9191,9827 Park st,999999999
8181,543 Hello st,1111111111
file2:
store_id,address,phone
9191,9827 Park st Apt82,999999999
7171,912 John st,87282728282
Expected output:
9191,9827 Park st Apt82,999999999
8181,543 Hello st,1111111111
7171,912 John st,87282728282
If you noticed :
9191,9827 Park st,999999999 and 9191,9827 Park st Apt82,999999999
are similar based on store_id and phone but I picked it up from file2 since the address was more descriptive.
store_id+phone_number
was my composite primary key to lookup a location and find duplicates (store_id is enough to find it in the above example but I need a key based on multiple column values)
Question:
- I need to merge multiple CSV files with same schema but with duplicate rows.
- Where the row level merge should have the logic to pick a specific value of a row based on its value. Like phone picked up from file1 and address pickedup from file2.
- A combination of 1 or many column values will define if rows are duplicate or not.
Can this be achieved using pandas?