Combining three datasets removing duplicates

Question

I've three datasets:

dataset 1

Customer1            Customer2          Exposures  + other columns
Nick McKenzie       Christopher Mill      23450
Nick McKenzie       Stephen Green         23450
Johnny Craston      Mary Shane              12
Johnny Craston      Stephen Green           12
Molly John          Casey Step          1000021

dataset2 (unique Customers: Customer 1 + Customer 2)

Customer            Age
Nick McKenzie       53
Johnny Craston      75
Molly John          34
Christopher Mill    63
Stephen Green       65
Mary Shane          54
Casey Step          34
Mick Sale

dataset 3

Customer1            Customer2          Exposures  + other columns
Mick Sale           Johnny Craston           
Mick Sale           Stephen Green

Exposures refers to Customer 1 only.

There are other columns omitted for brevity. Dataset 2 is built by getting unique customer 1 and unique customer 2: no duplicates are in that dataset. Dataset 3 has the same column of dataset 1.

I'd like to add the information from dataset 1 into dataset 2 to have

Final dataset

Customer            Age    Exposures   +   other columns
Nick McKenzie       53       23450
Johnny Craston      75       12
Molly John          34      1000021
Christopher Mill    63
Stephen Green       65
Mary Shane          54
Casey Step          34
Mick Sale

The final dataset should have all Customer1 and Customer 2 from both dataset 1 and dataset 3, with no duplicates. I have tried to combine them as follows

result = pd.concat([df2,df1,df3], axis=1)

but the result is not that one I'd expect. Something wrong is in my way of concatenating the datasets and I'd appreciate it if you can let me know what is wrong.

unfortunately the question was closed without any working solution. — LdM, Sep 15 '21 at 00:01

user2736738 · Accepted Answer · 2021-09-15T00:10:44.430

1

After concatenating the dataframe df1 and df2 (assuming they have same columns), we can remove the duplicates using df1.drop_duplicates(subset=['customer1']) and then we can join with df2 like this

df1.set_index('Customer1').join(df2.set_index('Customer'))

In case df1 and df2 has different columns based on the primary key we can join using the above command and then again join with the age table.

This would give the result. You can concatenate dataset 1 and datatset 3 because they have same columns. And then run this operation to get the desired result. I am joining specifying the respective keys.

Note: Though not related to the question but for the concatenation one can use this code pd.concat([df1, df3],ignore_index=True) (Here we are ignoring the index column)

edited Sep 15 '21 at 00:10

answered Sep 14 '21 at 23:27

user2736738

30,591
5
42
56

I am getting all NaN in the fields from df1. Do you know why? – LdM Sep 14 '21 at 23:32
But you didn't give us a minimal example that I could have run. The idea is correct if my sleep is not messing with me. If you create the dataframe creation code (hardcoded) and paste it here I can give it a try. – user2736738 Sep 14 '21 at 23:34
@LdM May be. Usually that's why community members request the users to post an example. It helps in long run. All the best. – user2736738 Sep 14 '21 at 23:39
@LdM I didn't because it was the same column in both of the datasets. So if you combine df1 and df2 (normal merging) and then you run the above code it would give the result. (unless I am missing something) – user2736738 Sep 14 '21 at 23:41
so first should I do `pd.concat([df1, df3], axis=1)` , right? – LdM Sep 14 '21 at 23:42
Yes that's what I meant. – user2736738 Sep 14 '21 at 23:43
@LdM Ideally that should be the case. I thought that's the thing you are looking for. In order to combine them you can further apply group. From your question I understood you wanted to get the age of each customer1 in these two datasets. That's what it does. – user2736738 Sep 14 '21 at 23:47
What is the primary key in df1 and df3? And in case of duplicate you remove based on just the primary key? – user2736738 Sep 14 '21 at 23:52
@LdM So it is "customer1" in both and in removal of duplicates we should keep the last one right? – user2736738 Sep 14 '21 at 23:54
yes, exactly. It does not matter if the first one or last one since I would be interested in the other columns, not in Customer2 – LdM Sep 14 '21 at 23:56
@LdM I have updated the answer and I think it gives the key things needed to know to solve the problem. If you have given example, I could have checked. SInce this is not the case I would have to stop here. – user2736738 Sep 15 '21 at 00:02
ok. thanks. df1 and df3 have the same columns, btw. df2 has only two columns, Customer and Age. – LdM Sep 15 '21 at 00:03
1

@LdM Yes then you can easily do it by merging and then remove duplicates on the Primary Key (`customer1`) and then join them to get the `age` column. And that would be it. – user2736738 Sep 15 '21 at 00:04
the problem was due to a missing ignore index. This should work: `pd.concat([df1, df3],ignore_index=True)`. If you could include it in your answer, I will more than happy to mark it. Also, we could delete comments to keep everything tidier. Thanks – LdM Sep 15 '21 at 00:08

Combining three datasets removing duplicates

1 Answers1