2

I've been searching for help with a merging query but I'm not sure of the most succinct way of phrasing my question, so haven't been able to find anything that helps directly.

I'm looking to merge two tables on three columns using a left join, however there is some null entries in the third field and in this case, I think I effectively want an outer join - so if there is a match on the first two columns, the data will be joined regardless of whether there is a match in the third column.

df_A:
Competitor    Product    Type    
A             P1         X
A             P2         X
A             P2         Y
B             P1         X
B             P1         Y 

df_B:
Competitor    Product    Type    Value    
A             P1         X       £5
A             P2         X       £10
A             P2         Y       £12
B             P1                 £15

I want to do a merge on these two tables using the Competitor, Product and Type fields. However, the 'Type' field in the second table will not always be populated and in this instance I want the Value to be applied to all Types in table A, i.e:

Competitor    Product    Type   Value   
A             P1         X      £5
A             P2         X      £10
A             P2         Y      £12
B             P1         X      £15
B             P1         Y      £15

I can successfully merge on the first two columns using the code:

df_merge=pd.merge(df_A,df_B,how='left',on=['Competitor','Product'])

but if I add the third column 'Type', this only populates the Value for those with matches in all columns, i.e.:

Competitor    Product    Type   Value   
A             P1         X      £5
A             P2         X      £10
A             P2         Y      £12
B             P1         
B             P1         

Is there a way of combining the join types, or any other way of getting to this solution?

Sarah
  • 497
  • 1
  • 7
  • 9

1 Answers1

4

You could merge on ['Competitor','Product']:

df_merged = pd.merge(df_A, df_B, how='left', on=['Competitor','Product'])

and then select the rows where either the type is the same or the Type_y is the wildcard value:

mask = (df_merged['Type'] == df_merged['Type_y']) | (df_merged['Type_y'] == '')
result = df_merged.loc[mask, ['Competitor','Product','Type','Value']]

For example,

import pandas as pd

df_A = pd.DataFrame({'Competitor': ['A', 'A', 'A', 'B', 'B'],
 'Product': ['P1', 'P2', 'P2', 'P1', 'P1'],
 'Type': ['X', 'X', 'Y', 'X', 'Y']})

df_B = pd.DataFrame({'Competitor': ['A', 'A', 'A', 'B'],
 'Product': ['P1', 'P2', 'P2', 'P1'],
 'Type': ['X', 'X', 'Y', ''],
 'Value': ['£5', '£10', '£12', '£15']},)

df_merged = pd.merge(df_A, df_B, how='left', on=['Competitor','Product'], 
                     suffixes=('','_y'))
mask = (df_merged['Type'] == df_merged['Type_y']) | (df_merged['Type_y'] == '')
result = df_merged.loc[mask, ['Competitor','Product','Type','Value']]

print(result)

yields

  Competitor Product Type Value
0          A      P1    X    £5
1          A      P2    X   £10
4          A      P2    Y   £12
5          B      P1    X   £15
6          B      P1    Y   £15

One criticism that could be leveled at using

pd.merge(df_A, df_B, how='left', on=['Competitor','Product'])

is that it could generate a lot of unnecessary rows -- all those rows where Type_x does not equal Type_y. If df_A and df_B are big, this could lead to excessive use of memory.

To address this problem, we could be more parsimonious with the memory by separating df_B into two parts: The rows with the wildcard values and those without:

is_wild = pd.isnull(df_B['Type'])
df_notwild, df_wild = df_B.loc[~is_wild], df_B.loc[is_wild]

then merge the two parts separately. When there are no wildcard values, we can inner-merge on all columns. When there are wildcards, we only want to right-merge on ['Competitor','Product']:

df_merged1 = pd.merge(df_A, df_notwild, how='inner')
df_merged2 = pd.merge(df_A, df_wild, how='right', on=['Competitor','Product'],
                      suffixes=('','_y')).drop('Type_y', axis=1)

Then the two DataFrames can be concatenated to form the desired result:

result = pd.concat([df_merged1, df_merged2], ignore_index=True)

Thus, to save memory,

import numpy as np
import pandas as pd

df_A = pd.DataFrame({'Competitor': ['A', 'A', 'A', 'B', 'B'],
 'Product': ['P1', 'P2', 'P2', 'P1', 'P1'],
 'Type': ['X', 'X', 'Y', 'X', 'Y']})

df_B = pd.DataFrame({'Competitor': ['A', 'A', 'A', 'B'],
 'Product': ['P1', 'P2', 'P2', 'P1'],
 'Type': ['X', 'X', 'Y', np.nan],
 'Value': ['£5', '£10', '£12', '£15']},)

is_wild = pd.isnull(df_B['Type'])
df_notwild, df_wild = df_B.loc[~is_wild], df_B.loc[is_wild]

df_merged1 = pd.merge(df_A, df_notwild, how='inner')
df_merged2 = pd.merge(df_A, df_wild, how='right', on=['Competitor','Product'],
                      suffixes=('','_y')).drop('Type_y', axis=1)

result = pd.concat([df_merged1, df_merged2], ignore_index=True)
print(result)

produces the same result as the first method, above.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • How would this be adapted for fields that are NaN? Adjusting the code to is_wild = df_B['Type'] == 'NaN' results in all entries of the 'is wild' table being False – Sarah Feb 01 '17 at 11:05
  • 1
    If the wildcard value is `NaN`, then use `is_wild = pd.isnull(df_B['Type'])`. I've updated the second example to demonstrate it. – unutbu Feb 01 '17 at 11:10