pandas selected columns from second dataframe where another column's values exist in a primary dataframe

Question

I'm struggling with quite a specific issue. I have two pandas dataframes of different lengths with different indexes. For each item contained in df1, I want to look into df2 and take a couple of columns (not contained in df1), where the values of one of the df2 columns are equal to those in df1. Example:

import pandas as pd

data_1 = {'TARGET_NAME':['fishinghook', 'doorlock', 'penguin', 'ashtray', 'cat', 'elephant', 'cupcake', 'exercisebench'],
          'FOOBAR':['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
          'ix':[320, 321, 322, 323, 324, 325, 326, 328]}

data_2 = {'IMAGE_NAME':['cat', 'penguin', 'jewelrybox', 'exercisebench', 'doorlock', 'jar', ],
          'VALUES_1':['h', 'h', 'c', 'm', 'h', 'f'],
          'VALUES_2':['hm', 'hl', 'cm', 'ml', 'hh', 'fl'],
          'ix':[616, 617, 618, 619, 620, 621]}

desired = {'TARGET_NAME':['fishinghook', 'doorlock', 'penguin', 'ashtray', 'cat', 'elephant', 'cupcake', 'exercisebench'],
          'FOOBAR':['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
          'PRODUCED_VALUES_1':['DROPPED', 'h', 'h', 'DROPPED', 'h', 'DROPPED', 'DROPPED', 'm'],
          'ix':[320, 321, 322, 323, 324, 325, 326, 328]}

df1 = pd.DataFrame(data_1, index=data_1['ix'])
df2 = pd.DataFrame(data_2, index=data_2['ix'])
desired_df = pd.DataFrame(desired, index=desired['ix'])

df1
Out[2]: 
    FOOBAR    TARGET_NAME   ix
320    foo    fishinghook  320
321    bar       doorlock  321
322    foo        penguin  322
323    bar        ashtray  323
324    foo            cat  324
325    bar       elephant  325
326    foo        cupcake  326
328    bar  exercisebench  328

df2
Out[3]: 
        IMAGE_NAME VALUES_1 VALUES_2   ix
616            cat        h       hm  616
617        penguin        h       hl  617
618     jewelrybox        c       cm  618
619  exercisebench        m       ml  619
620       doorlock        h       hh  620
621            jar        f       fl  621

desired_df
Out[4]: 
    FOOBAR PRODUCED_VALUES_1    TARGET_NAME   ix
320    foo           DROPPED    fishinghook  320
321    bar                 h       doorlock  321
322    foo                 h        penguin  322
323    bar           DROPPED        ashtray  323
324    foo                 h            cat  324
325    bar           DROPPED       elephant  325
326    foo           DROPPED        cupcake  326
328    bar                 m  exercisebench  328

I want to look at each value in df1['TARGET_NAME'] and, where it equals df2['IMAGE_NAME'], take the VALUES_1 & VALUES_2 columns from df2 and add those details to df1 (or a copy of df1). If it doesn't match anywhere in df2 (because positions are all different as well) then I want it to write something else (e.g. 'DROPPED'). Ideally I want the df1 indexes to remain the same.

Any help appreciated!

score 4 · Accepted Answer · answered Sep 16 '17 at 07:49

You can outer merge the data by renaming the column and then rename the columns with your desired column names and then fill the nan of produced_values with dropped and drop the nans. At last set the df1 index.

ndf = df1.merge(df2.rename(columns = {'IMAGE_NAME':'TARGET_NAME'}),how='outer',on='TARGET_NAME')
ndf = ndf.drop(['ix_y','VALUES_2'],1).rename(columns={'ix_x':'ix','VALUES_1':'PRODUCED_VALUES_1'})

ndf['PRODUCED_VALUES_1'] = ndf['PRODUCED_VALUES_1'].fillna('Dropped')
ndf = ndf.dropna().set_index(df1.index)

    FOOBAR    TARGET_NAME     ix PRODUCED_VALUES_1
320    foo    fishinghook  320.0           Dropped
321    bar       doorlock  321.0                 h
322    foo        penguin  322.0                 h
323    bar        ashtray  323.0           Dropped
324    foo            cat  324.0                 h
325    bar       elephant  325.0           Dropped
326    foo        cupcake  326.0           Dropped
328    bar  exercisebench  328.0                 m

Hey! Anything to add [here](https://stackoverflow.com/questions/54028199/for-loops-with-pandas-when-should-i-care/54028200#54028200)? :) — cs95, Jan 04 '19 at 05:17

MaxU - stand with Ukraine · Answer 2 · 2017-09-16T09:38:47.917

In [34]: df1['PRODUCED_VALUES_1'] = \
             df1['TARGET_NAME'].map(df2.set_index('IMAGE_NAME')['VALUES_1']) \
                               .fillna('DROPPED')

In [35]: df1
Out[35]:
    FOOBAR    TARGET_NAME   ix PRODUCED_VALUES_1
320    foo    fishinghook  320           DROPPED
321    bar       doorlock  321                 h
322    foo        penguin  322                 h
323    bar        ashtray  323           DROPPED
324    foo            cat  324                 h
325    bar       elephant  325           DROPPED
326    foo        cupcake  326           DROPPED
328    bar  exercisebench  328                 m

or one-liner which is similar to @Bharath shetty's solution:

In [26]: df1.merge(df2[['IMAGE_NAME','VALUES_1']].rename(columns={'IMAGE_NAME':'TARGET_NAME'}),
    ...:           how='left') \
    ...:    .fillna('DROPPED') \
    ...:    .rename(columns=lambda c: 'PRODUCED_' + c if c=='VALUES_1' else c) \
    ...:    .set_index(df1.index)
    ...:
Out[26]:
    FOOBAR    TARGET_NAME   ix PRODUCED_VALUES_1
320    foo    fishinghook  320           DROPPED
321    bar       doorlock  321                 h
322    foo        penguin  322                 h
323    bar        ashtray  323           DROPPED
324    foo            cat  324                 h
325    bar       elephant  325           DROPPED
326    foo        cupcake  326           DROPPED
328    bar  exercisebench  328                 m

My first approach was map but couldnt get the solution so went to merge. This is really nice — Bharath M Shetty, Sep 16 '17 at 09:37

pandas selected columns from second dataframe where another column's values exist in a primary dataframe

2 Answers2