How to join two dataframe and keep certain columns of each?

Question

I have two dataframes, names and claims:

names = pd.DataFrame({
    'UniqueID': 'A B C D E F'.split(),
    'Name':['Susie', 'George Foreman', 'Charles', 'Nicole', 'Peter Piper', 'Penelope Cruz'],
    'Address':['111 3rd St', '123 Bank St', '555 Square Sq', '9 Charlton Ave', 'PO Box 1', 'The White House'], 
    'Phone number':['2032218686', '2032032203', '8048048804', '2232645879', '2564544469', '8005865555']})

  UniqueID            Name          Address Phone number
0        A           Susie       111 3rd St   2032218686
1        B  George Foreman      123 Bank St   2032032203
2        C         Charles    555 Square Sq   8048048804
3        D          Nicole   9 Charlton Ave   2232645879
4        E     Peter Piper         PO Box 1   2564544469
5        F   Penelope Cruz  The White House   8005865555


claims = pd.DataFrame({
    'ClaimNo':range(29,38),
    'ClaimDetails':['Slip and fall','Clmt slipped and fell','Thunderstorms are scary','Hail storm damage',
                   'Property fire','Arson','Shooting','Shooting and fatality','Slip and fall'],
    'PolicyNo':['00058566-0','00056455-5','00058588-8','00011111-2','00088787-0','00045658-0','00012345-6','00065432-1','00088080-4'],
    'UniqueID':'A F F D E A D E E'.split()})

   ClaimNo             ClaimDetails    PolicyNo UniqueID
0       29            Slip and fall  00058566-0        A
1       30    Clmt slipped and fell  00056455-5        F
2       31  Thunderstorms are scary  00058588-8        F
3       32        Hail storm damage  00011111-2        D
4       33            Property fire  00088787-0        E
5       34                    Arson  00045658-0        A
6       35                 Shooting  00012345-6        D
7       36    Shooting and fatality  00065432-1        E
8       37            Slip and fall  00088080-4        E

I want to create a new DataFrame containing only the rows of names for which the UniqueID appears in claims. I'm not sure if they should be merged or filtered.. I've been trying different types of merges but I can't seem to get the result I want, which should look like this:

  UniqueID           Name          Address Phone number
0        A          Susie       111 3rd St   2032218686
1        D         Nicole   9 Charlton Ave   2232645879
2        E    Peter Piper         PO Box 1   2564544469
3        F  Penelope Cruz  The White House   8005865555

Beauregard D · Accepted Answer · 2019-05-10T16:39:06.203

This seemed like the simplest way for me:

names[names.UniqueID.isin(claims['UniqueID'].to_numpy())]

edit: for other people who are answering, here are some helper dictionary/dataframe variables I used in order to answer OP's question:

data1 = {"UniqueID": {"0": "A", "1": "B", "2": "C", "3": "D", "4": "E", "5": "F"}, "Name": {"0": "Susie", "1": "George Foreman", "2": "Charles", "3": "Nicole", "4": "Peter Piper", "5": "Penelope Cruz"}, "Address": {"0": "111 3rd St", "1": "123 Bank St", "2": "555 Square Sq", "3": "9 Charlton Ave", "4": "PO Box 1", "5": "The White House"}, "Phone number": {"0": 2032218686, "1": 2032032203, "2": 8048048804, "3": 2232645879, "4": 2564544469, "5": 8005865555}}
names = pd.DataFrame.from_dict(data1)

data2 = {"ClaimNo": {"0": 29, "1": 30, "2": 31, "3": 32, "4": 33, "5": 34, "6": 35, "7": 36, "8": 37}, "ClaimDetails": {"0": "Slip and fall", "1": "Clmt slipped and fell", "2": "Thunderstorms are scary", "3": "Hail storm damage", "4": "Property fire", "5": "Arson", "6": "Shooting", "7": "Shooting and fatality", "8": "Slip and fall"}, "PolicyNo": {"0": "00058566-0", "1": "00056455-5", "2": "00058588-8", "3": "00011111-2", "4": "00088787-0", "5": "00045658-0", "6": "00012345-6", "7": "00065432-1", "8": "00088080-4"}, "UniqueID": {"0": "A", "1": "F", "2": "F", "3": "D", "4": "E", "5": "A", "6": "D", "7": "E", "8": "E"}}
claims = pd.DataFrame.from_dict(data2)

OP: it would be helpful if you provided these variables next time, I had to use pd.read_fwf to read your fixed-width format tables into dictionaries objects

Thanks, I ended up using this without `.to_numpy()` and it worked perfectly. — Nicole Suter, May 10 '19 at 17:11

score 0 · Answer 2 · answered May 10 '19 at 16:11

0

Does this not work?

print (pd.merge(names, claims, on='UniqueID'))

then maybe you can remove the columns that you don't need

data = data.drop(columns="some_column_name")

answered May 10 '19 at 16:11

ウィエム

405
4
12

This doesn't really work, the first command results in a dataframe with duplicates in the columns from `names` (9 rows instead of my desired 4). – Nicole Suter May 10 '19 at 17:14

cameravinga · Answer 3 · 2019-05-10T16:19:14.123

0

You can use the merge method. Just make sure that the UniqueID column in both dataframes are of the same dtype (in this case most likely 'str').

new_df = df1.merge(df2, how='inner' ,on='UniqueID')

If this does not work, as mentioned above it is because your columns are of different dtypes. They may also have extra whitespace. In order to change both of those things, you can do:

df1['UniqueID'] = df1['UniqueID'].astype(str).str.replace(" ","")
df2['UniqueID'] = df2['UniqueID'].astype(str).str.replace(" ","")

And then you can drop the columns that you don't need:

new_df = new_df.drop(columns=['ClaimDetails','PolicyNo'])

edited May 10 '19 at 16:19

answered May 10 '19 at 16:13

cameravinga

43
9

Like @ウィエム 's comment, this gives me duplicates. Also, the UniqueID column in both dataframes is a string with no extra whitespace. – Nicole Suter May 10 '19 at 17:16

How to join two dataframe and keep certain columns of each?

3 Answers3