1

I have two dataframes, df1 and df2. df1 has repeat observations arranged in wide format, and df2 in long format.

import pandas as pd
df1 = pd.DataFrame({"ID":[1,2,3],"colA_1":[1,2,3],"date1":["1.1.2001", "2.1.2001","3.1.2001"],"colA_2":[4,5,6],"date2":["1.1.2002", "2.1.2002","3.1.2002"]})
df2 = pd.DataFrame({"ID":[1,1,2,2,3,3],"col1":[1,1.5,2,2.5,3,3.5],"date":["1.1.2001", "1.1.2002","2.1.2001","2.1.2002","3.1.2001","3.1.2002"], "col3":[11,12,13,14,15,16],"col4":[21,22,23,24,25,26]})

df1 looks like:

   ID  colA_1     date1  colA_2     date2
0   1       1  1.1.2001       4  1.1.2002
1   2       2  2.1.2001       5  2.1.2002
2   3       3  3.1.2001       6  3.1.2002

df2 looks like:

   ID  col1     date1  col3  col4
0   1   1.0  1.1.2001    11    21
1   1   1.5  1.1.2002    12    22
2   2   2.0  2.1.2001    13    23
3   2   2.5  2.1.2002    14    24
4   3   3.0  3.1.2001    15    25
5   3   3.5  3.1.2002    16    26
6   3   4.0  4.1.2002    17    27


I want to take a given column from df2, "col3", and then:

(1) if the columns "ID" and "date" in df2 match with the columns "ID" and "date1" in df1, I want to put the value in a new column in df1 called "colB_1".

(2) else if the columns "ID" and "date" in df2 match with the columns "ID" and "date2" in df1, I want to put the value in a new column in df1 called "colB_2".

(3) else if the columns "ID" and "date" in df2 have no match with either ("ID" and "date1") or ("ID" and "date2"), I want to ignore these rows.

So, the output of this output dataframe, df3, should look like this:

   ID  colA_1     date1  colA_2     date2  colB_1  colB_2
0   1       1  1.1.2001       4  1.1.2002      11      12
1   2       2  2.1.2001       5  2.1.2002      13      14
2   3       3  3.1.2001       6  3.1.2002      15      16

What is the best way to do this?

I found this link, but the answer doesn't work for my case. I would like a really explicit way to specify column matching. I think it's possible that df.mask might be able to help me, but I am not sure how to implement it.

e.g.: the following code

df3 = df1.copy()
df3["colB_1"] = ""
df3["colB_2"] = ""
filter1 = (df1["ID"] == df2["ID"]) & (df1["date1"] == df2["date"])
filter2 = (df1["ID"] == df2["ID"]) & (df1["date2"] == df2["date"])
df3["colB_1"] = df.mask(filter1, other=df2["col3"])
df3["colB_2"] = df.mask(filter2, other=df2["col3"])

gives the error

ValueError: Can only compare identically-labeled Series objects

I asked this question previously, and it was marked as closed; my question was marked as a duplicate of this one. However, this is not the case. The answers in the linked question suggest the use of either map or df.merge. Map does not work with multiple conditions (in my case, ID and date). And df.merge (the answer given for matching multiple columns) does not work in my case when one of the column names in df1 and df2 that are to be merged are different ("date" and "date1", for example).

For example, the below code:

df3 = df1.merge(df2[["ID","date","col3"]], on=['ID','date1'], how='left')

fails with a Key Error.

Also noteworthy is that I will be dealing with many different files, with many different column naming schemes, and I will need a different subset each time. This is why I would like an answer that explicitly names the columns and conditions.

Any help with this would be much appreciated.

abra
  • 153
  • 2
  • 10

1 Answers1

0

You can the pd.wide_to_long after replacing the underscore , this will unpivot the dataframe which you can use to merge with df2 and then pivot back using unstack:

m =df1.rename(columns=lambda x: x.replace('_',''))
unpiv = pd.wide_to_long(m,['colA','date'],'ID','v').reset_index()

merge_piv = (unpiv.merge(df2[['ID','date','col3']],on=['ID','date'],how='left')
     .set_index(['ID','v'])['col3'].unstack().add_prefix('colB_'))

final = df1.merge(merge_piv,left_on='ID',right_index=True)

   ID  colA_1     date1  colA_2     date2  colB_1  colB_2
0   1       1  1.1.2001       4  1.1.2002      11      12
1   2       2  2.1.2001       5  2.1.2002      13      14
2   3       3  3.1.2001       6  3.1.2002      15      16
anky
  • 74,114
  • 11
  • 41
  • 70