0

The output of a MERGE operation on two pandas data frames does not yield the expected result:

**dfmatrix**:
    …   young   label   filename
0   …   1       neg     cv005_29357
1   …   0       neg     cv006_17022
2   …   0       neg     cv007_4992
3   …   1       neg     cv008_29326
4   …   1       neg     cv009_29417

**dfscores**:
   filename  score
0  cv005_29357   -10
1  cv006_17022   5

dfnew = pandas.merge(dfmatrix, dfscores, on='filename', how='outer', left_index=False, right_index=False)

**dfnew**:
   …    young   label   filename    score_y
0  …    0       neg     cv005_29357 NaN
1  …    1       neg     cv006_17022 NaN
2  …    0       neg     cv007_4992  NaN
3  …    0       neg     cv008_29326 NaN
4  …    1       neg     cv009_29417 NaN

Excpected Output:

**dfnew**:
   …    young   label   filename    score_y
0  …    0       neg     cv005_29357 -10
1  …    1       neg     cv006_17022 5
2  …    0       neg     cv007_4992  NaN
3  …    0       neg     cv008_29326 NaN
4  …    1       neg     cv009_29417 NaN

What am I doing wrong?

Update: this post suggests that MERGE is the way to go for the purposes of joining two data frames

Community
  • 1
  • 1
  • show these as already read in frames; the indices are very important. – Jeff Jul 24 '14 at 13:39
  • 1
    Works in pandas `0.14.1` what version are you running? – EdChum Jul 24 '14 at 13:39
  • @Jeff, I have added the indexes now –  Jul 24 '14 at 13:55
  • @EdChum, pip show pandas: 0.14.1 –  Jul 24 '14 at 13:57
  • 2
    Your output does not match your code, how can you get a column of `score_y` if there are no clashes? You only have column `score` in dfscores and not in the other df. – EdChum Jul 24 '14 at 13:59
  • @EdChum, dfmatrix is a bag-of-words matrix which happens to have a column for the word 'score' –  Jul 24 '14 at 14:08
  • The output that you posted is definitely not correct, you should have 6 rows. Reason it "isn't working" is cv005_29357 != cv006_29357 and cv006_17022 != cv009_17022. An outer merge should recognize these as unequal and not join these rows together – ZJS Jul 24 '14 at 15:19
  • this is my mistake. when re-writing this post to add the indexes as suggested i messed it up a bit , i have corrected this now in the post. Sorry about that. –  Jul 24 '14 at 15:28
  • ok now try your code again. It works perfectly for me. I am also using pandas 0.14.1 – ZJS Jul 24 '14 at 17:41

1 Answers1

0

the problem was at the file level: the entries in the filename column of the dfscores file being read had a trailing whitespace which caused the JOIN to fail. Admitted, this is not a glorious moment for me but neverthelesss these things happen and i think it is worth posting the answer as it may happen to other less experienced coders.

To automate the process:

dfscores['filename'] = dfscores['filename'].map(lambda x: x.strip())

source: Pandas DataFrame: remove unwanted parts from strings in a column

Community
  • 1
  • 1