0

I have two API queries that return approximately 400 columns of data each by about 10,0000 rows of data deep. The columns are of all sorts of different types and in different order. The data is not always the same. I have a mapping file that gives me a column-to-column mapping. There’s one id column in both data sets that’s supposed to be common to both data sets.

I need to merge the two data sets, align the mapped columns side by side, and add a difference.

Looking for some guidance on whether there's a built-in library that handles something like this?

Here’s a simplified version of the problem for illustration:

Df1

Id1 Data1 Data2 Data3 Data4
A 1 1 1 1
B 2 2 2 2
C 3 3 3 3

Df2

Id2 Data4 Data1 Data2 Data3
A 1 1 1 1
B 2 2 2 2
C 3 3 3 3

Mapping

Df2-data Df1-data
Data1 Data1
Data2 Data2
Data3 Data3
Data4 Data4
Id2 Id1

Df-out

Id1 Id2 Df1-Data1 Df2-Data1 Diff-Data1 Df1-Data2 Df2-Data2 Diff-Data2
A A 1 1 0 1 1 0
B B 2 2 0 2 2 0
C C 3 3 0 3 3 0
Digital Farmer
  • 1,705
  • 5
  • 17
  • 67
Monduras
  • 47
  • 8

1 Answers1

1

May be there is an easier way to do it, but this is what I come up with.

# Melt the two DFs (df1, df2) and merge them together on the ID, 
# then merge it with the mapping, resulting in the dataframe that has 
# relationship b/w the two DFs and the mapping

df3 = df1.melt('Id1',var_name='df1_data', value_name='df1').merge(
      df2.melt('Id2',var_name='df2_data', value_name='df2'), 
    left_on=['Id1'], 
    right_on=['Id2']).merge(
    dfmap,
      left_on=['df2_data','df1_data'], 
      right_on=['Df2-data','Df1-data'])

# calculate the different of the two values

df3['diff'] = df3['df1'] - df3['df2']

# the column of interest
cols = ['Id1', 'df1', 'Id2',  'df2', 'diff', 'df1_data'] 

#pivot to get the desired layout
df4=df3[cols].pivot(index=['Id1','Id2'], columns=['df1_data'])  

#merge column names
df4.columns = [cols[1]+'_'+cols[0] for cols in df4.columns]

#sort columns to match the desired sequence of columns
df4.reindex(sorted(df4.columns), axis=1)
         Data1_df1 Data1_df2 Data1_diff Data2_df1 Data2_df2 Data2_diff Data3_df1 Data3_df2 Data3_diff Data4_df1 Data4_df2 Data4_diff
Id1 Id2                                                 
A   A      1          1           0         1        1        0            1        1         0          1            1         0
B   B      2          2           0         2        2        0            2        2         0          2            2         0
C   C      3          3           0         3        3        0            3        3         0          3            3         0
Naveed
  • 11,495
  • 2
  • 14
  • 21
  • 1
    thanks! i think this works for the sample use case! I had to unpack your nested melts and merges to make sense of what was going on. I'll try and adapt this to the data I'm using. – Monduras Jun 20 '22 at 14:35