How to approach a complicated dataframe merge

Question

I have two API queries that return approximately 400 columns of data each by about 10,0000 rows of data deep. The columns are of all sorts of different types and in different order. The data is not always the same. I have a mapping file that gives me a column-to-column mapping. There’s one id column in both data sets that’s supposed to be common to both data sets.

I need to merge the two data sets, align the mapped columns side by side, and add a difference.

Looking for some guidance on whether there's a built-in library that handles something like this?

Here’s a simplified version of the problem for illustration:

Df1

Id1	Data1	Data2	Data3	Data4
A	1	1	1	1
B	2	2	2	2
C	3	3	3	3

Df2

Id2	Data4	Data1	Data2	Data3
A	1	1	1	1
B	2	2	2	2
C	3	3	3	3

Mapping

Df2-data	Df1-data
Data1	Data1
Data2	Data2
Data3	Data3
Data4	Data4
Id2	Id1

Df-out

Id1	Id2	Df1-Data1	Df2-Data1	Df1-Data2	Df2-Data2
A	A	1	1	1	1
B	B	2	2	2	2
C	C	3	3	3	3

@ifly6 ty this is very useful for understanding merge, but how do i handle the re-ordering of the columns in the data frame based on the mapping? — Monduras, Jun 18 '22 at 01:48
Pass `indicator=True` and then iterate through your mapping to calculate and calculate the requisite diffs. — ifly6, Jun 18 '22 at 03:26

score 1 · Accepted Answer · answered Jun 18 '22 at 15:20

May be there is an easier way to do it, but this is what I come up with.

# Melt the two DFs (df1, df2) and merge them together on the ID, 
# then merge it with the mapping, resulting in the dataframe that has 
# relationship b/w the two DFs and the mapping

df3 = df1.melt('Id1',var_name='df1_data', value_name='df1').merge(
      df2.melt('Id2',var_name='df2_data', value_name='df2'), 
    left_on=['Id1'], 
    right_on=['Id2']).merge(
    dfmap,
      left_on=['df2_data','df1_data'], 
      right_on=['Df2-data','Df1-data'])

# calculate the different of the two values

df3['diff'] = df3['df1'] - df3['df2']

# the column of interest
cols = ['Id1', 'df1', 'Id2',  'df2', 'diff', 'df1_data'] 

#pivot to get the desired layout
df4=df3[cols].pivot(index=['Id1','Id2'], columns=['df1_data'])  

#merge column names
df4.columns = [cols[1]+'_'+cols[0] for cols in df4.columns]

#sort columns to match the desired sequence of columns
df4.reindex(sorted(df4.columns), axis=1)

         Data1_df1 Data1_df2 Data1_diff Data2_df1 Data2_df2 Data2_diff Data3_df1 Data3_df2 Data3_diff Data4_df1 Data4_df2 Data4_diff
Id1 Id2                                                 
A   A      1          1           0         1        1        0            1        1         0          1            1         0
B   B      2          2           0         2        2        0            2        2         0          2            2         0
C   C      3          3           0         3        3        0            3        3         0          3            3         0

thanks! i think this works for the sample use case! I had to unpack your nested melts and merges to make sense of what was going on. I'll try and adapt this to the data I'm using. — Monduras, Jun 20 '22 at 14:35

How to approach a complicated dataframe merge

1 Answers1