1

When I use pandas.DataFrame.replace(dict) to convert user_id string to integer, I receive:

"OverflowError: Python int too large to convert to C long".

sample code:

import pandas as pd
x = {'user_id':['100000715097692381911', 
                '100003840837471130074'], 
     'item_id': [1, 2]
     }
dfx = pd.DataFrame(x)
dfx['user_id'].replace(
    {
     '100000715097692381911': 0, 
     '100003840837471130074': 1
     }, inplace=True)

I don't understand why this is duplicated. I think this is a problem of pandas taking str type as integers. I didn't load those big id numbers as integer but as string. Well, if I prepend an character to 'user_id' string, like 's100000715097692381911', it will not report OverflowError.

Weihao Wang
  • 21
  • 1
  • 4
  • @Aeossa I searched before asking, and it seems the link you post can't solve my problem. I think maybe pandas somehow take the user_id string as a large integer? But dfx.dtypes shows its type is object – Weihao Wang Mar 08 '19 at 14:32

1 Answers1

1

In C, a long is 4 bytes and can only store values between -2,147,483,648 and 2,147,483,647.

To answer your other question, a string in C is stored as a char array, and so it's memory space is 1 byte for each char, plus the size of the terminating pointer. This means a python string in C won't cause an overflow, but a large integer will.

Source: https://www.tutorialspoint.com/cprogramming/c_data_types.htm

Luke Ning
  • 147
  • 6
  • Thanks, but I didn't load those big numbers as integers. You can see that they are string type in my code. – Weihao Wang Mar 08 '19 at 14:44
  • @WeihaoWang Loading the big numbers as ints might help you. Try letting the big numbers be ints, and change your df definition to `dfx = pd.DataFrame(x, dtype=str)` and then changing your `replace` to look for int keys instead – Luke Ning Mar 08 '19 at 14:51
  • I tried your suggestion, it doesn't work. Thank you anyway. Could it be a bug of pandas? – Weihao Wang Mar 08 '19 at 14:58
  • @WeihaoWang really? it worked for me. Are you still getting an overflow error? – Luke Ning Mar 08 '19 at 15:02
  • uh.. it didn't report overflow, but the elements are not replaced with correct values. – Weihao Wang Mar 08 '19 at 15:09
  • @WeihaoWang did you change your dictionary in your `replace` to ints instead of string keys? – Luke Ning Mar 08 '19 at 15:22
  • Yes, I did. I think I did what you said exactly. Could you post your code in case I misunderstood. – Weihao Wang Mar 08 '19 at 15:55
  • ```import pandas as pd x = {'user_id': [100000715097692381911, 100003840837471130074], 'item_id': [1, 2] } dfx = pd.DataFrame(x, dtype=str) dfx['user_id'].replace( { 100000715097692381911: 0, 100003840837471130074: 1 }, inplace=True) ``` You're gonna have fix the formatting – Luke Ning Mar 08 '19 at 16:02
  • Well, it is a pandas BUG, and I'll report it. I tried my example code with another computer, where the environment setup is pandas 0.20.3, python 3.6.3, Win 10, it works okay. The overflow error I reported was on my laptop with pandas 0.24.0, python 3.7, Ubuntu 18.04. – Weihao Wang Mar 08 '19 at 16:07