0

I have a simple data frame with first and last names. I would like to get the equivalent of SQL SELF JOIN equivalent in pandas.

Here goes the full example:

import numpy as np
import pandas as pd


df = pd.DataFrame({'first_name': ['Rose','Summer','Jane','Kim','Jack'],
                   'last_name': ['Howard','Solstice','Kim','Cruz','Rose'],
                  'customer_id': [1,2,3,4,5]})
df
  first_name last_name  customer_id
0       Rose    Howard            1
1     Summer  Solstice            2
2       Jane       Kim            3
3        Kim      Cruz            4
4       Jack      Rose            5

REQUIRED OUTPUT

customer_id first_name  last_name   customer_id_1   first_name_1    last_name_1
1   Rose    Howard  5   Jack    Rose
4   Kim Cruz    3   Jane    Kim

Using SQL

select a.first_name, a.last_name, b.first_name, b.last_name
from df as a, df as b
where a.first_name = b.last_name

My attempt

(pd.concat( [ df[['first_name','last_name']],
             df[['first_name','last_name']].add_suffix('_1')
           ], axis=1, ignore_index=False)
)
  first_name last_name first_name_1 last_name_1
0       Rose    Howard         Rose      Howard
1     Summer  Solstice       Summer    Solstice
2       Jane      Rose         Jane        Rose

But,

(pd.concat( [ df,df.add_suffix('_1')], axis=1)
 .query(" first_name == last_name_1 ")
)
This gives empty output to my surprise!!

I want two rows and fours columns as given by SQL.

BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169

1 Answers1

0

Use left_on and right_on

df.merge(df, left_on='first_name', right_on='last_name')

Result:

  first_name_x last_name_x  customer_id_x first_name_y last_name_y  \
0         Rose      Howard              1         Jack        Rose   
1          Kim        Cruz              4         Jane         Kim   

   customer_id_y  
0              5  
1              3
BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169