2

I have a two Pandas dataframes and want to intersect (aka inner join) them on a key. I understand how a Pandas dataframe merge() works, but the problem I have is that I want to keep the columns from one dataframe, not the columns from both.

For example, suppose I have the following two dataframes, df_users and df_valid_users.

df_users                     df_valid_users

user_id  email  dob          user_id last_login 
-------  -----  ---          ------- ----------
123      ...    ...          456     ...
456      ...    ...          710     ...
789      ...    ...

Desired result:

user_id  email  dob
-------  -----  ---
456      ...    ...

I want to get all the columns from df_users as long as df_users.user_id is also in df_valid_users.

I tried to use pandas.merge(), but the result contains the columns from both dataframes. How do keep only the columns from the left dataframe? My real data has dozens of columns, not just a few as in this example.

import pandas as pd

rows_users = [ ['123', 'foo', '1970-01-01'],
               ['456', 'bar', '1980-01-01'],
               ['789', 'baz', '1990-01-01']]
df_users = pd.DataFrame(rows_users, columns=['user_id', 'email', 'dob'])
df_users
#   user_id email         dob
# 0     123   foo  1970-01-01
# 1     456   bar  1980-01-01
# 2     789   baz  1990-01-01

rows_valid_users = [ ['456', '2018-01-10'],
                     ['710', '1918-01-02']]
df_valid_users = pd.DataFrame(rows_valid_users, columns=['user_id', 'last_login'])
df_valid_users
#   user_id  last_login
# 0     456  2018-01-10
# 1     710  1918-01-02


pd.merge(df_users, df_valid_users, how='inner', on=['user_id'])
#   user_id email         dob  last_login
# 0     456   bar  1980-01-01  1918-01-02

Any help would be appreciated.

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217

1 Answers1

3

I think you need isin :-)

df_users.loc[df_users['user_id'].isin(df_valid_users['user_id']),:]
Out[37]: 
  user_id email         dob
1     456   bar  1980-01-01
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    Thank you. I'm guessing the `isin()` function must run on all rows of the left dataframe. Is that as efficient as a call to `merge()`? Usually an inner-join is implemented efficiently (at least it is in SQL). – stackoverflowuser2010 Jan 18 '18 at 00:59
  • @stackoverflowuser2010 base on my experience it should be fast , you can test both method on your side and check the time : -) – BENY Jan 18 '18 at 01:12
  • @YOBEN_S - How can we use `loc` function for a combination of two fields eg. `df_users.loc[df_users['email','dob'].isin(df_valid_users['email','dob']),:]`? Tried, but raises key error. – Love Putin Not War Jun 06 '20 at 15:54
  • @user12379095 https://stackoverflow.com/questions/48647534/python-pandas-find-difference-between-two-data-frames/48647840#48647840 this may give you some idea ~ Method 1 and Method 2 – BENY Jun 06 '20 at 15:57
  • @YOBEN_S - Thanks for replying. With **Method 1** all the rows of the first DF are being output in the "merged" DF. With **Method 2** something funny is happening. The row that should not be there is also appearing. Basically I am trying to do a comparison between two DFs, on a combination of two fields., so as to only output columns of one DF (preferably the "left"), and not both as I am getting now with `.merge`. – Love Putin Not War Jun 06 '20 at 17:08
  • 2
    @user12379095 df_users.loc[df_users[['email','dob']].apply(tuple,1).isin(df_valid_users[['email','dob']].apply(tuple,1)),:] – BENY Jun 06 '20 at 17:14
  • @YOBEN_S - I don't know what to say ... It worked!! **Holy c...**. You the savior!! Would send you a million $. – Love Putin Not War Jun 06 '20 at 17:25
  • @BENY: Hi! While searching SO stumbled on this great solution and it works! I am however, facing a problem. Here, the field "field2" is a decimal number (pls see my code). While executing the method on two DF (df1 and df2), values like 4560.00 are caught but the ones with 4560.65 are not. Why? My code goes like this (which is exactly as you suggested, only the 2nd field is a decimal number here): `join_df = df1.loc[df1[['field1', 'field2']].apply(tuple,1).isin(df2[['field1', 'field2']].apply(tuple,1)),:]`. How do I make identical decimal numbers also be detected using the same method. Thnx. – carla Nov 05 '22 at 15:01
  • @carla float always be the problem , try to convert back to str for match only ~ – BENY Nov 05 '22 at 15:30
  • @BENY Yeah tried to change the values to str using `.astype(str)`. Now values such as **4560.65** are getting detected, but the ones with **4560.00** are **not**. – carla Nov 05 '22 at 17:09
  • @BENY Finally got the result by changing the field to `astype(float)`. And works... – carla Nov 05 '22 at 17:43