46

I'm trying to do a simple merge between two dataframes. These come from two different SQL tables, where the joining keys are strings:

>>> df1.col1.dtype
dtype('O')
>>> df2.col2.dtype
dtype('O')

I try to merge them using this:

>>> merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2')

The result of the inner join is empty, which first prompted me that there might not be any entries in the intersection:

>>> merge_res.shape
(0, 19)

But when I try to match a single element, I see this really odd behavior.

# Pick random element in second dataframe
>>> df2.iloc[5,:].col2
'95498208100000'

# Manually look for it in the first dataframe
>>> df1[df1.col1 == '95498208100000']
0 rows × 19 columns
# Empty, which makes sense given the above merge result

# Now look for the same value as an integer
>>> df1[df1.col1 == 95498208100000]
1 rows × 19 columns
# FINDS THE ELEMENT!?!

So, the columns are defined with the 'object' dtype. Searching for them as strings don't yield any results. Searching for them as integers does return a result, and I think this is the reason why the merge doesn't work above..

Any ideas what's going on?

It's almost as thought Pandas converts df1.col1 to an integer just because it can, even though it should be treated as a string while matching.

(I tried to replicate this using sample dataframes, but for small examples, I don't see this behavior. Any suggestions on how I can find a more descriptive example would be appreciated as well.)

user1496984
  • 10,957
  • 8
  • 37
  • 46
  • 1
    it looks like you have mixed dtypes in your columns, I suggest first trying to coerce all values to numeric so `df1['col1'] = df1['col1'].astype(int)` this may fail which means you have some str values which cannot be expressed as int, so next try `df1['col1'] = pd.to_numeric(df1['col1'], errors='coerce')` which will force the duff values to `NaN` where it can – EdChum Sep 19 '16 at 22:31
  • Ah, thanks @EdChum! Some of the values can be converted to `int`, but others can't. The proper dtype of that column should therefore be `str`, but since that's not a proper dtype, I thought `object` would work. But when I manually converted everything to `str`, the merge did work. Thanks! – user1496984 Sep 19 '16 at 22:59

7 Answers7

72

The issue was that the object dtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings.

The solution was to make sure that every field is a string:

>>> df1.col1 = df1.col1.astype(str)
>>> df2.col2 = df2.col2.astype(str)

Then the merge works as expected.

(I wish there was a way of specifying a dtype of str...)

user1496984
  • 10,957
  • 8
  • 37
  • 46
  • 7
    Weird. Your solution worked. But both before and after, the dtype of the variable in question is 'O'. I guess, like you alluded to, there's more under the hood with these object types. Wish it was more transparent. – Ben Ogorek Dec 22 '17 at 15:58
  • 2
    God blless you good man ! You saved me hours of sceatching my head! – Irka Irenka Mar 21 '20 at 06:01
  • The same happens when you apply a vlookup in excel for such columns. I was getting erroneous results in a quick vlookup so resorted to pandas but was getting the same output (nan) there. – Asad Rauf Jul 28 '21 at 12:13
  • thank you! Once i converted the column i m merging on to str in both columns it worked - phew! – vanetoj Sep 23 '22 at 17:24
27

I ran into a case where the df.col = df.col.astype(str) solution did not work. Turns out the problem was in the encoding.

My original data looked like this:

In [72]: df1['col1'][:3]
Out[73]: 
             col1
0  dustin pedroia
1  kevin youkilis
2     david ortiz

In [72]: df2['col2'][:3]
Out[73]: 
             col2
0  dustin pedroia
1  kevin youkilis
2     david ortiz

And after using .astype(str) the merge still wasn't working so I executed the following:

df1.col1 = df1.col1.str.encode('utf-8')
df2.col2 = df2.col2.str.encode('utf-8')

and was able to find the difference:

In [95]: df1
Out[95]: 
                       col1
0  b'dustin\xc2\xa0pedroia'
1  b'kevin\xc2\xa0youkilis'
2     b'david\xc2\xa0ortiz'

In [95]: df2
Out[95]: 
                col2
0  b'dustin pedroia'
1  b'kevin youkilis'
2     b'david ortiz'

At which point all I had to do was run df1.col1 = df1.col1.str.replace('\xa0',' ') on the decoded df1.col1 variable (i.e. before running .str.encode('utf-8')) and the merge worked perfectly.

NOTE: Regardless of what I was replacing I always used .str.encode('utf-8') to check whether it worked.

Alternatively

Using regular expressions and the Variable Explorer in the Spyder IDE for Anaconda I found the following difference.

import re
#places the raw string into a list
df1.col1 = df1.col1.apply(lambda x: re.findall(x, x))  
df2.col2 = df2.col2.apply(lambda x: re.findall(x, x))

where my df1 data turned into this (copied and pasted from Spyder):

['dustin\xa0pedroia']
['kevin\xa0youkilis']
['david\xa0ortiz']

which just has a slightly different solution. I don't know in what case the first example wouldn't work and the second would but I wanted to provide both just in case someone runs into it :)

seeiespi
  • 3,628
  • 2
  • 35
  • 37
  • Very helpful! Had the same problem and it turned out to be the encoding, which I did not expect! Thanks! – Diego Duarte Feb 08 '21 at 12:28
  • Thanks, it was the \xa0 issue because of web scraping with beautifulsoup for my case. Replacing it solved the issue – EFKan Dec 08 '21 at 22:33
14

Thanks,@seeiespi the ..str.encode('utf-8') has helped me to figure out that my string needs to be stripped, as below

20                 b'Belize '   ...     0,612
21                  b'Benin '   ...     0,546

The solution was to use the strip

df1.col1 = df1.col1.str.strip()
df1.col1 = df1.col1.str.strip()
0xFK
  • 2,433
  • 32
  • 24
  • This .strip() saved my day!! And actually it's only visible after encoding, so I recommend that. Thank u so much! – ranemak Jun 05 '22 at 14:29
4

None of the above solutions worked for me because the merge was actually done correctly but the indexing got messed up. Removing the index solved it for me:

df['sth'] = df.merge(df2, how='left', on=['x', 'y'])['sth'].values
MartenCatcher
  • 2,713
  • 8
  • 26
  • 39
Jan
  • 41
  • 1
  • wow, this was unexpected. Thank you, this answer helped me a ton; merging two string columns, one of them unique. pd.merge has its quirks. – Krisselack Mar 19 '20 at 09:09
3

This answer solved it for me:

pd.merge(df1.assign(x=df1.x.astype(str)), 
         df2.assign(x=df2.x.astype(str)), 
         how='left', on='x')

from Pandas merge issue on key of object type containing number and string values

Kyle Pennell
  • 5,747
  • 4
  • 52
  • 75
2

Probably your columns have some difference or a blank space that is resulting on this error.

First check your column type and if there is any different between items

df1.col1 = df1.col1.str.encode('utf-8')
df2.col2 = df2.col2.str.encode('utf-8')

If there is any difference on them, you can use

df1.col1 = df1.col1.str.replace("this", "for that")

or if there is any blank space

df1.col1 = df1.col1.apply(str).str.strip()
# This apply(str) is being used because without it, the program returns an error related to being enable to convert from byte.
0

This step:

df1.col1 = df1.col1.str.strip()
df1.col1 = df1.col1.str.strip()

and then doing this step:

pd.merge(df1.assign(x=df1.x.astype(str)), 
         df2.assign(x=df2.x.astype(str)), 
         how='left', on='x')

worked for me. I meant BOTH together.