0

I have huge data and I want merge it on a specific column where values from df1 is not available in df2 and vice versa.

Example:

df1:

Domain        Sales
google.com    100
facebook.com  200
youtube.com   300

df2:

Domain       Sales
google.com   100
yahoo.com    200
youtube.com  300

Required output:

Domain          Sales
facebook.com    200
yahoo.com       200

I have tried:

df = pd.merge(df1, df2, on="Domain", how="outer")

and all the other values for the how parameter, but it does not give me the required output. How can I achieve the required output?

Vishwa Mittar
  • 378
  • 5
  • 16

3 Answers3

2
pd.concat([df1, df2]).drop_duplicates('Domain', keep=False)
Panda Kim
  • 6,246
  • 2
  • 12
  • NB. If the two dataframes are huge as indicated in the question, this will create a huge intermediate – mozway Nov 07 '22 at 14:27
  • @mozway it worked fine for me, but I am concerned about "huge intermediate", what does it mean? – Vishwa Mittar Nov 07 '22 at 14:34
  • @VishwaMittar this means that this approach will create a dataframe combining all data and **then** remove the duplicates. Opposite to an approach of combining only the unique rows. You'll have to test with your data what is the most efficient (in terms of speed and memory) – mozway Nov 07 '22 at 14:41
  • @mozway what could be the solution for only combining unique values? – Vishwa Mittar Nov 07 '22 at 14:45
  • @mozway i chk time by 1.5mil rows take 0.31984 sec / your answer 0.40953 sec – Panda Kim Nov 07 '22 at 14:48
  • I provided one below, but test on your real data, maybe this one is efficient enough – mozway Nov 07 '22 at 14:48
  • @PandaKim the intermediate issue is not a problem of speed, but of **memory**. Typically, if you have 2 Dataframes that take up more than half the available RAM in total, concatenating first will exhaust the RAM. – mozway Nov 12 '22 at 06:49
1

As you have a huge dataset you can use set operations to pre-filter the data before concat:

S = set(df1['Domain']).symmetric_difference(df2['Domain'])
out = pd.concat([d.loc[lambda d: d['Domain'].isin(S)] for d in [df1, df2]])

output:

         Domain  Sales
1  facebook.com    200
1     yahoo.com    200
mozway
  • 194,879
  • 13
  • 39
  • 75
0

We still can try with merge

df = df1.merge(df2,on = 'Domain', indicator = True, how='outer').query('_merge!="both"')
df['Sales'] = df.pop('Sales_x').fillna(df.pop('Sales_y'))
df
Out[342]: 
         Domain      _merge  Sales
1  facebook.com   left_only  200.0
3     yahoo.com  right_only  200.0
BENY
  • 317,841
  • 20
  • 164
  • 234