Pandas: merging two dataframes

Question

The output of a MERGE operation on two pandas data frames does not yield the expected result:

**dfmatrix**:
    …   young   label   filename
0   …   1       neg     cv005_29357
1   …   0       neg     cv006_17022
2   …   0       neg     cv007_4992
3   …   1       neg     cv008_29326
4   …   1       neg     cv009_29417

**dfscores**:
   filename  score
0  cv005_29357   -10
1  cv006_17022   5

dfnew = pandas.merge(dfmatrix, dfscores, on='filename', how='outer', left_index=False, right_index=False)

**dfnew**:
   …    young   label   filename    score_y
0  …    0       neg     cv005_29357 NaN
1  …    1       neg     cv006_17022 NaN
2  …    0       neg     cv007_4992  NaN
3  …    0       neg     cv008_29326 NaN
4  …    1       neg     cv009_29417 NaN

Excpected Output:

**dfnew**:
   …    young   label   filename    score_y
0  …    0       neg     cv005_29357 -10
1  …    1       neg     cv006_17022 5
2  …    0       neg     cv007_4992  NaN
3  …    0       neg     cv008_29326 NaN
4  …    1       neg     cv009_29417 NaN

What am I doing wrong?

Update: this post suggests that MERGE is the way to go for the purposes of joining two data frames

show these as already read in frames; the indices are very important. — Jeff, Jul 24 '14 at 13:39
Your output does not match your code, how can you get a column of `score_y` if there are no clashes? You only have column `score` in dfscores and not in the other df. — EdChum, Jul 24 '14 at 13:59
@EdChum, dfmatrix is a bag-of-words matrix which happens to have a column for the word 'score' — , Jul 24 '14 at 14:08
The output that you posted is definitely not correct, you should have 6 rows. Reason it "isn't working" is cv005_29357 != cv006_29357 and cv006_17022 != cv009_17022. An outer merge should recognize these as unequal and not join these rows together — ZJS, Jul 24 '14 at 15:19
this is my mistake. when re-writing this post to add the indexes as suggested i messed it up a bit , i have corrected this now in the post. Sorry about that. — , Jul 24 '14 at 15:28
ok now try your code again. It works perfectly for me. I am also using pandas 0.14.1 — ZJS, Jul 24 '14 at 17:41

score 0 · Accepted Answer · edited May 23 '17 at 10:33

the problem was at the file level: the entries in the filename column of the dfscores file being read had a trailing whitespace which caused the JOIN to fail. Admitted, this is not a glorious moment for me but neverthelesss these things happen and i think it is worth posting the answer as it may happen to other less experienced coders.

To automate the process:

dfscores['filename'] = dfscores['filename'].map(lambda x: x.strip())

source: Pandas DataFrame: remove unwanted parts from strings in a column

Pandas: merging two dataframes

1 Answers1