17

I have two dataframes,

df1 = pd.DataFrame({'A': ['A1', 'A1', 'A2', 'A3'],
                     'B': ['121', '345', '123', '146'],
                     'C': ['K0', 'K1', 'K0', 'K1']})

df2 = pd.DataFrame({'A': ['A1', 'A3'],
                      'BB': ['B0', 'B3'],
                      'CC': ['121', '345'],
                      'DD': ['D0', 'D1']})

Now I need to get the similiar rows from column A and B from df1 and column A and CC from df2. And so I tried possible merge options, such as:

both_DFS=pd.merge(df1,df2, how='left',left_on=['A','B'],right_on=['A','CC'])

and this will not give me row information from df2 dataframe which is what I needed. Meaning, I have all column names from df2 but the rows are just empty or Nan.

And then I tried:

Both_DFs=pd.merge(df1,df2, how='left',left_on=['A','B'],right_on=['A','CC'])[['A','B','CC']]

And this give me error as,

KeyError: "['B'] not in index"

I am aiming to have a merged Dataframe with all columns from both df1 and df2. Any suggestions would be great

Desired output:

 Both_DFs
    A   B   C   BB  CC  DD
0   A1  121 K0  B0  121 D0

So in my data frames (df1 and df2), only one row has exact match for both columns of interest. That is, Column A and B from df1 has only one row matching exactly to rows in columns A and CC in df2

vestland
  • 55,229
  • 37
  • 187
  • 305
ARJ
  • 2,021
  • 4
  • 27
  • 52
  • What is `print (df1.columns.tolist())` ? Problem is with real data only? – jezrael May 02 '17 at 10:10
  • Because it seems there is some whitespace in column name only, for removing need `df.columns = df.columns.str.strip()` – jezrael May 02 '17 at 10:12
  • The Actual datafarme has another column name the df1 used in my question is dummy. So with my actual datafarme its print out, ['Chr', 'Start', 'End', 'Annotation', 'Detailed Annotation', ' Description', ' Type'] for (df1.columns.tolist()) – ARJ May 02 '17 at 10:13
  • @jezrael I do Stripped while reading it in pd.read_csv itself for all columns and rows. – ARJ May 02 '17 at 10:13
  • Super, still `KeyError`? Is possible problem in `print (df2.columns.tolist())` ? – jezrael May 02 '17 at 10:14
  • @j print (df2.columns.tolist()) for second datafrme its , ['Chr', 'Start', 'End', 'chr', 'start', 'end', 'gene_sym', 'Lines'] giving me actual columns – ARJ May 02 '17 at 10:15
  • Yes, but maybe problem is in `df2.columns` names, because `['Chr', 'Start', 'End', 'Annotation', 'Detailed Annotation', ' Description', ' Type']` seems nice. – jezrael May 02 '17 at 10:16
  • How ? I mean I am wondering how does it make a problem while merging – ARJ May 02 '17 at 10:18
  • I have no idea, becuse all seems nice. :( – jezrael May 02 '17 at 10:19
  • In sample have `NaN` values, because no match in data. Try change `df2` - `df2 = pd.DataFrame({'A': ['A2', 'A3'], 'BB': ['B0', 'B3'], 'CC': ['121', '345'], 'DD': ['D0', 'D1']})` – jezrael May 02 '17 at 10:28
  • Well it didnt help :) – ARJ May 02 '17 at 10:52

3 Answers3

14

Well, if you declare column A as index, it works:

Both_DFs = pd.merge(df1.set_index('A', drop=True),df2.set_index('A', drop=True), how='left',left_on=['B'],right_on=['CC'], left_index=True, right_index=True).dropna().reset_index()

This results in:

    A    B   C  BB   CC  DD
0  A1  123  K0  B0  121  D0
1  A1  345  K1  B0  121  D0
2  A3  146  K1  B3  345  D1

EDIT

You just needed:

Both_DFs = pd.merge(df1,df2, how='left',left_on=['A','B'],right_on=['A','CC']).dropna()

Which gives:

    A    B   C  BB   CC  DD
0  A1  121  K0  B0  121  D0
zipa
  • 27,316
  • 6
  • 40
  • 58
  • Its merging for right columns but the problem is same , The for the right dataframe here df2 the columns in Both_DFs is just empty or Nan. There are rows from the df1 got merged to Both_DFs dataframe, same as my above script. The columns from df2 are there but the rows just empty – ARJ May 02 '17 at 10:25
  • Made an edit, seems to work :) – zipa May 02 '17 at 10:44
  • Yes, It worked :) Thank you – ARJ May 02 '17 at 10:56
  • @zipa - I think `left_on=['B'],right_on=['CC']` can be removed also, because no match in `B` and `CC`. Can you also add your output? – jezrael May 02 '17 at 11:29
  • @jezrael It can be removed in this case, but maybe OP has some data where it shouldn't be removed :) – zipa May 02 '17 at 11:41
  • @zipa - Hmmm, I thought first data are first join by indexes and then by `on` parameter, but it seems there is only match by `index`. So `on` can be removed. But if i am wrong, give me know. Thanks. – jezrael May 02 '17 at 11:43
  • @jezrael From my perspective only thing that bugs me is this 3rd row that doesn't look like it should be there. You explained why is it there, but I think it just shouldn't. – zipa May 02 '17 at 11:46
  • @zipa, But there is a problem.Its printing out the duplicated rows as well. Also, the merge is not based on the exact matching rows from both dataframes. Could you pease revise your code once more – ARJ May 02 '17 at 11:46
  • Sure, I'm on it :) – zipa May 02 '17 at 11:47
  • @user1017373 - can you add desired output to question? – jezrael May 02 '17 at 11:53
  • @jezreal, Sure added :) – ARJ May 02 '17 at 12:01
  • @user1017373 Just `dropna()`, your join was fine. – zipa May 02 '17 at 12:04
  • No as you see there is 3 rows but there should have been only one in the real case. As in, only one of it has exact matching rows as per condition. I have edited my question for more clarity. Sorry for being unclear – ARJ May 02 '17 at 12:08
  • Please look at the edit – zipa May 02 '17 at 12:11
4

You can also use join with default left join or merge, last if necessary remove rows with NaNs by dropna:

print (df1.join(df2.set_index('A'), on='A').dropna())
    A    B   C  BB   CC  DD
0  A1  123  K0  B0  121  D0
1  A1  345  K1  B0  121  D0
3  A3  146  K1  B3  345  D1

print (pd.merge(df1, df2, on='A', how='left').dropna())
    A    B   C  BB   CC  DD
0  A1  123  K0  B0  121  D0
1  A1  345  K1  B0  121  D0
3  A3  146  K1  B3  345  D1

EDIT:

I think you need inner join (by default, so on='inner' can be omit):

Both_DFs = pd.merge(df1,df2, left_on=['A','B'],right_on=['A','CC'])
print (Both_DFs)
    A    B   C  BB   CC  DD
0  A1  121  K0  B0  121  D0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

I don't know if your example show exactly your problem but,

If we try to merge with MultiIndex, we need to have the 2 index matching.

df1['A'] == df2['A'] && df1['B'] == df2['CC']

Here we haven't any row that match the 2 index.

If we merge just by df1['A'], we got something like this : Both_DFs=pd.merge(df1, df2, how='left', left_on=['A'], right_on=['A'])

    A    B   C   BB   CC   DD
0  A1  123  K0   B0  121   D0
1  A1  345  K1   B0  121   D0
2  A2  121  K0  NaN  NaN  NaN
3  A3  146  K1   B3  345   D1

If you wan't remove line row that not in df2 try to change 'how' method to inner.

Both_DFs=pd.merge(df1, df2, how='left', left_on=['A'], right_on=['A'])
   A    B   C   BB   CC   DD
0  A1  123  K0   B0  121   D0
1  A1  345  K1   B0  121   D0
2  A3  146  K1   B3  345   D1

Did this approach of what you're looking for ?

Jérémy Caré
  • 315
  • 3
  • 12