I have two dataframes, df1 and df2. df1 has repeat observations arranged in wide format, and df2 in long format.
import pandas as pd
df1 = pd.DataFrame({"ID":[1,2,3],"colA_1":[1,2,3],"date1":["1.1.2001", "2.1.2001","3.1.2001"],"colA_2":[4,5,6],"date2":["1.1.2002", "2.1.2002","3.1.2002"]})
df2 = pd.DataFrame({"ID":[1,1,2,2,3,3],"col1":[1,1.5,2,2.5,3,3.5],"date":["1.1.2001", "1.1.2002","2.1.2001","2.1.2002","3.1.2001","3.1.2002"], "col3":[11,12,13,14,15,16],"col4":[21,22,23,24,25,26]})
df1 looks like:
ID colA_1 date1 colA_2 date2
0 1 1 1.1.2001 4 1.1.2002
1 2 2 2.1.2001 5 2.1.2002
2 3 3 3.1.2001 6 3.1.2002
df2 looks like:
ID col1 date1 col3 col4
0 1 1.0 1.1.2001 11 21
1 1 1.5 1.1.2002 12 22
2 2 2.0 2.1.2001 13 23
3 2 2.5 2.1.2002 14 24
4 3 3.0 3.1.2001 15 25
5 3 3.5 3.1.2002 16 26
6 3 4.0 4.1.2002 17 27
I want to take a given column from df2, "col3", and then:
(1) if the columns "ID" and "date" in df2 match with the columns "ID" and "date1" in df1, I want to put the value in a new column in df1 called "colB_1".
(2) else if the columns "ID" and "date" in df2 match with the columns "ID" and "date2" in df1, I want to put the value in a new column in df1 called "colB_2".
(3) else if the columns "ID" and "date" in df2 have no match with either ("ID" and "date1") or ("ID" and "date2"), I want to ignore these rows.
So, the output of this output dataframe, df3, should look like this:
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16
What is the best way to do this?
I found this link, but the answer doesn't work for my case. I would like a really explicit way to specify column matching. I think it's possible that df.mask might be able to help me, but I am not sure how to implement it.
e.g.: the following code
df3 = df1.copy()
df3["colB_1"] = ""
df3["colB_2"] = ""
filter1 = (df1["ID"] == df2["ID"]) & (df1["date1"] == df2["date"])
filter2 = (df1["ID"] == df2["ID"]) & (df1["date2"] == df2["date"])
df3["colB_1"] = df.mask(filter1, other=df2["col3"])
df3["colB_2"] = df.mask(filter2, other=df2["col3"])
gives the error
ValueError: Can only compare identically-labeled Series objects
I asked this question previously, and it was marked as closed; my question was marked as a duplicate of this one. However, this is not the case. The answers in the linked question suggest the use of either map or df.merge. Map does not work with multiple conditions (in my case, ID and date). And df.merge (the answer given for matching multiple columns) does not work in my case when one of the column names in df1 and df2 that are to be merged are different ("date" and "date1", for example).
For example, the below code:
df3 = df1.merge(df2[["ID","date","col3"]], on=['ID','date1'], how='left')
fails with a Key Error.
Also noteworthy is that I will be dealing with many different files, with many different column naming schemes, and I will need a different subset each time. This is why I would like an answer that explicitly names the columns and conditions.
Any help with this would be much appreciated.