1

A similar question was asked here Pandas merge on multiple columns ignoring NaN but without answer, so I'll ask maybe someone can help.

I need to merge values from df2 into df1, but the key used in the merge differs between rows in df2, as the rows in df2 have NaNs in different columns, and in that case I want to ignore those columns, and use for each row only the columns that have values.

df1 = pd.DataFrame([[0, 1, 3], [0, 2, 4], [1, 2, 5]], columns=['level1', 'level2', 'level3'])
df1    
    level1 level2 level3
0        0      1      3
1        0      2      4 
2        1      2      5 

df2 = pd.DataFrame([[0, None, None, 10], [0, 1, None, 12], [None, 2, 5, 13]], columns=['level1', 'level2', 'level3', 'value'])
df2
    level1 level2 level3 value
0      0.0    NaN    NaN    10 
1      0.0    1.0    NaN    12 
2      NaN    2.0    5.0    13 

When I do df1.merge(df2, how='left'), I get df1 with NaN in the value column, since there is no match on all the level columns as pandas is trying to match the Nan values as well.

What I do want is to get a match for any rows in df2 without trying to match the NaNs:

    level1 level2 level3 value
0        0      1      3    10 
1        0      1      3    12 
2        0      2      4    10 
3        1      2      5    13 

Explanation: Row 0 in df1 has a match on the non-NaN columns of rows 0 and 1 in df2, so it gets values 10 and 12 from there. Row 1 in df1 has a match on the non-NaN columns of row 0 in df2, so it gets value 12 from there. Row 2 in df1 has a match on the non-NaN columns of row 2 in df2, so it gets value 13 from there.

In the real data I actually have 6 level columns and the non-NaN columns for each row in df2 can be any combination or a single column from there.

What I do now is to iterrows the rows in df2, create for each one a mini-dataframe of only the non-NaN columns, and merge df1 with it. But as we know, it's not really efficient, and I wonder it there something better that can be done.

Aryerez
  • 3,417
  • 2
  • 9
  • 17

2 Answers2

2

I think I figured out a vectorized solution.

  1. Fundamentally, the idea is that you merge df1 with df2 three separate times for the number of levels and then concat the dataframes together into one.
  2. From there, you count how many columns are null and subtract from the number of levels. This tells you how many duplicates (or matches) are required in order to prevent the data from being dropped later.
  3. Then, you calculate how many matches or duplicates there actually are. If actual is the same as required, then that means the row is a match, and it gets kept in the dataframe.

It's not pretty, but to improve my answer you could create a merging function to cut some of the code. Most importantly, it should be highly performant compared to looping through every row. As a final note, for the duplicates_required helper column, you will need to change the 3 to a 6 since you have 6 columns in your actual dataset and you will obviously need to repeat some of my merging code:

df1 = pd.DataFrame([[0, 1, 3], [0, 2, 4], [1, 2, 5]], columns=['level1', 'level2', 'level3'])
df2 = pd.DataFrame([[0, None, None, 10], [0, 1, None, 12], [None, 2, 5, 13]], columns=['level1', 'level2', 'level3', 'value'])
df2 = df2.assign(duplicates_required = 3 - df2.isnull().sum(axis=1))
df = pd.concat([
df1.merge(df2[['level1','value', 'duplicates_required']], on='level1'),
df1.merge(df2[['level2','value', 'duplicates_required']], on='level2'),
df1.merge(df2[['level3','value', 'duplicates_required']], on='level3')
    ])
cols = ['level1', 'level2', 'level3', 'value']
df['actual_duplicates'] = df.groupby(cols)['value'].transform('size')
df = (df[df['duplicates_required'].eq(df['actual_duplicates'])]
     .drop_duplicates(subset=cols)
     .drop(['duplicates_required', 'actual_duplicates'], axis=1)
     .reset_index(drop=True))
df
Out[1]: 
   level1  level2  level3  value
0       0       1       3     10
1       0       1       3     12
2       0       2       4     10
3       1       2       5     13
docPersson
  • 83
  • 5
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • It seems to work. I'll dive into it later, to make sure it covers all my real data options. Will comment again if there is a problem. – Aryerez Aug 25 '21 at 07:39
0

I think this works better than my previous answer using regex. Similar process, but a bit simpler to understand.

  1. Do a full merge of the two dataframes
  2. Compare across levels and count number of mismatches
  3. Filter to rows where mismatch count == 0
import pandas as pd

df1 = pd.DataFrame([[0, 1, 3], [0, 2, 4], [1, 2, 5]], columns=['level1', 'level2', 'level3'])
df2 = pd.DataFrame([[0, None, None, 10], [0, 1, None, 12], [None, 2, 5, 13]], columns=['level1', 'level2', 'level3', 'value'])

levels_to_match = ['level1','level2','level3']
levels_to_match_df2 = [level + '_df2' for level in levels_to_match]

for df in [df1,df2]:
    df['temp'] = 1
 
df1 = df1.merge(df2, on='temp', suffixes=[None,'_df2']).drop(columns=['temp'])

df1['mismatch'] = df1.apply(lambda x: 
    sum([(1 - (y == z or pd.isna(z))) for y, z in zip(list(x[levels_to_match]), list(x[levels_to_match_df2]))]), 
    axis=1)
    
df1 = df1.loc[df1['mismatch'] == 0, :].drop(columns=['mismatch'] + levels_to_match_df2)

print(df1)

   level1  level2  level3  value
0       0       1       3     10
1       0       1       3     12
3       0       2       4     10
8       1       2       5     13

Old answer with regex

Probably not ideal, but maybe try converting your levels into strings and regex expressions, then do a full merge of all possible combinations, and finally filter using a regex search/match across two helper columns (one from df1 and the other from df2).

Assuming the data you're matching on are either int or NaN then this seems to work okay. If you have other data types in your real data then the string/regex transformations will need to be adjusted accordingly.

import pandas as pd
import re

df1 = pd.DataFrame([[0, 1, 3], [0, 2, 4], [1, 2, 5]], columns=['level1', 'level2', 'level3'])
df2 = pd.DataFrame([[0, None, None, 10], [0, 1, None, 12], [None, 2, 5, 13]], columns=['level1', 'level2', 'level3', 'value'])

levels_to_match = ['level1','level2','level3']

for df in [df1,df2]:
    df['helper'] = df[levels_to_match].apply(list, axis=1)
    df['helper'] = df['helper'].apply(lambda x: ','.join([str(int(item)) if pd.notna(item) else '.*' for item in x]))
    df['temp'] = 1

df1 = df1.merge(df2.drop(columns=levels_to_match), on='temp', suffixes=[None,'_df2']).drop(columns=['temp'])
df1['match'] = df1.apply(lambda x: re.search(x['helper_df2'], x['helper']) is not None, axis=1)

df1 = df1.loc[df1['match'], :].drop(columns=['helper','helper_df2','match'])

print(df1)

   level1  level2  level3  value
0       0       1       3     10
1       0       1       3     12
3       0       2       4     10
8       1       2       5     13
StevenS
  • 662
  • 2
  • 7