19

None of these questions adress the issue: Question 1 and Question 2 nor could I find the answer in pandas documentation.

Hello, I am trying to find the underlying cause for this error:

ValueError: You are trying to merge on object and int64 columns.

I know I can work around this problem using pandas concat or merge function, but I am trying to understand the cause for the error. The question is: Why do I get this ValueError?

Here's the output of the head(5) and info() on both dataframes that are used.

print(the_big_df.head(5)) Output:

  account  apt  apt_p  balance       date  day    flag  month  reps     reqid  year
0  AA0420    0    0.0  -578.30 2019-03-01    1       1      3    10  82f2d761  2019
1  AA0420    0    0.1  -578.30 2019-03-02    2       1      3    10  82f2d761  2019
2  AA0420    0    0.1  -578.30 2019-03-03    3       1      3    10  82f2d761  2019
3  AA0421    0    0.1  -607.30 2019-03-04    4       1      3    10  82f2d761  2019
4  AA0421    0    0.1  -610.21 2019-03-05    5       1      3    10  82f2d761  2019

print(the_big_df.info()) Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36054 entries, 0 to 36053
Data columns (total 11 columns):
account        36054 non-null object
apt            36054 non-null int64
apt_p          36054 non-null float64
balance        36054 non-null float64
date           36054 non-null datetime64[ns]
day            36054 non-null int64
flag           36054 non-null int64
month          36054 non-null int64
reps           36054 non-null int32
reqid          36054 non-null object
year           36054 non-null int64
dtypes: datetime64[ns](1), float64(2), int32(1), int64(5), object(2)
memory usage: 3.2+ MB

Here's the dataframe I'm passing to the join(); print(df_to_join.head(5)):

      reqid     id
0  54580f39  13301
1  3ba905c0  77114
2  5f2d80da  13302
3  a1478e98  77115
4  9b09854b  78598

print(df_to_join.info()) Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14332 entries, 0 to 14331
Data columns (total 2 columns):
reqid    14332 non-null object
dni      14332 non-null object

The exact next line after the 4 prints stated above is:

the_max_df = the_big_df.join(df_to_join,on='reqid')

And the output is, as stated above:

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

Why does this happen, when before it is clearly stated that column reqid is an object in both dataframes? Thanks.

Celius Stingher
  • 17,835
  • 6
  • 23
  • 53

3 Answers3

56

The problem here is a misconception of how join works: when you say the_big_df.join(df_to_join,on='reqid') it does not mean join on the_big_df.reqid == df_to_join.reqid as one would assume at first glance, but rather join on the_big_df.reqid == df_to_join.index. As requid is of type object and the index is of type int64 you get the error.

See the docs for join:

Join columns with other DataFrame either on index or on a key column.
...
on : str, list of str, or array-like, optional
Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index.

Look at the following example:

df1 = pd.DataFrame({'id1': [1, 2], 'val1': [11,12]})
df2 = pd.DataFrame({'id2': [3, 4], 'val2': [21,22]})
print(df1)
#   id1  val1
#0    1    11
#1    2    12
print(df2)
#   id2  val2
#0    3    21
#1    4    22

# join on df1.id1 (int64) == df2.index (int64) 
print(df1.join(df2, on='id1'))
#   id1  val1  id2  val2
#0    1    11  4.0  22.0
#1    2    12  NaN   NaN

# now df3 same as df1 but id3 as object:
df3 = pd.DataFrame({'id3': ['1', '2'], 'val1': [11,12]})

# try to join on df3.id3 (object) == df2.index (int64) 
df3.join(df2, on='id3')
#ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat


Please note: the above is for a modern versions of pandas. Version 20.3 gave the following result:
>>> df3.join(df2, on='id3')
  id3  val1  id2  val2
0   1    11  NaN   NaN
1   2    12  NaN   NaN
Stef
  • 28,728
  • 2
  • 24
  • 52
5

Use pandas.DataFrame.merge.

See the docs for merge

Aadil Srivastava
  • 609
  • 8
  • 12
  • 3
    Thanks for your answer. I wrote in the question that I know I could solve it with merge, but wanted to understand the problem I was facing using `.join()` – Celius Stingher May 05 '21 at 13:32
0

Why don't you astype(str) to both reqid columns and see if it's still an issue.

manwithfewneeds
  • 1,137
  • 1
  • 7
  • 10
  • I will try that tomorrow! They *should* strings given their format. The info method shows they are, but the join says different! – Celius Stingher Sep 04 '19 at 22:13