1

I am working a database that is very poorly organized. There are CustomerIds that are somehow bigger than int64. Here is an example: 88168142359034442077.0

In order to be able to use this ID, I need to turn it into a string and remove the decimal. I have tried to use the following code:

testdf = pd.DataFrame({'CUSTID': ['99418675896216.02342351', '88168142359034442077.0213', '53056496953']})
testdf['CUSTID'] = testdf['CUSTID'].astype('float64').astype('int64').astype(str)
testdf.display()

When I use the above method I get an overflow and then the numbers that are bigger than int64 becomes negative like: -9223372036854775808 for 88168142359034442077.0213

I have being looking for other ways to be able to make the the change from string to float, then float to int, and finally int to string again.

One method that I tried is to just not use astype('int64'), but it makes the the output into scientific format like: 8.816814235903445e+19 for 88168142359034442077.0213 and other than using regex to remove the decimal and 'e+19' I don't really see what else I can do.

Any information is appreciated. Thanks!

hpaulj
  • 221,503
  • 14
  • 230
  • 353
Trodenn
  • 37
  • 5
  • 1
    What's the proper customer ID for `88168142359034442077.0213`? Drop everything after the decimal, or just omit the decimal and end with `0770213`? – Adam Smith Nov 28 '22 at 21:43
  • Try `.astype(object)` after `.astype(int64)`. Note that the performance will be pretty bad since large CPython objects are pretty slow compared to native integers (it uses callbacks calling the CPython interpreter code instead of pure-C vectorized Numpy functions). Do not use floats, they are not precise enough for this (it may work for some value but not for all of them). The negative values are due to overflows. – Jérôme Richard Nov 28 '22 at 21:45
  • as far as I was told, I can drop everything after the decimal. So in the case of 88168142359034442077.0213 it would become 88168142359034442077 . – Trodenn Nov 28 '22 at 21:46
  • @JérômeRichard So you mean I write something like: testdf['CUSTID'] = testdf['CUSTID'].astype('int64').astype(object).astype(str)? testdf.display() Suppose you had to deal with customers IDs that are written this way, how would you go about to make something more optimized? – Trodenn Nov 28 '22 at 21:49
  • Ha, I missed that the `astype('int64')` will cause the overflow anyway. You need to use `astype(object)` from a string column but AFAIK Pandas should already store string-based columns into an object-based array. Thus, you can convert objects manually. – Jérôme Richard Nov 28 '22 at 21:56
  • @JérômeRichard OK, i see what you mean. However I still don't really know what would be the best way to remove the decimal with the dtype being object. Do I just use strip() or regex to do so? – Trodenn Nov 28 '22 at 21:57
  • `testdf` with strings will be `object` dtype Series. That column could also hold Python ints which could be large. You have decide for yourself whether the decimal part has any significance; if you want to drop it, I try to do so at the string level, and then go direct to int, skipping the float step. – hpaulj Nov 29 '22 at 00:23

3 Answers3

2

Posting as an Answer because this became too large and I believe has further value

I'd be very surprised if those values are the real and expected IDs and not an erroneous artifact of importing some text or binary format

Specifically, the authoring program(s) and database itself are almost-certainly not using some high-memory decimal representation for a customer identifier, and would instead be "normal" types such as an int64 if they are represented that way at all!

Further, floating-point values expose programs to IEEE 754 floating point aliasing woes (see Is floating point math broken?), which will subtly foil all sorts of lookups and comparisons, and generally just wouldn't be able to pleasantly or consistently represent these values, so it's unlikely that anyone would reasonably use them

A contrived example

>>> data1 = "111001111001110100110001111000110110110111110101111000111001110110110010110001110110101110110000110010110011110100110010110011110101110001"
>>> data2 = "111000111000110001110110111000110001110100110010110011110101111001110000110011110100110100110100110010110000110111110111101110110000110010110001110011"
>>> for data in (data1, data2):
...     print("".join(chr(eval("0b" + data[block:block+6])) for block in range(0, len(data), 6)))
... 
99418675896216.02342351
88168142359034442077.0213

It's a long shot, but perhaps a fair suspicion that this can happen when

  • a user(s) is entering a new entry, but doesn't have a customer ID (yet?)
  • a UI is coded to only accept numeric strings
  • there is no other checking and the database stores the value as a string
  • upon discovering this, user(s) regularly jumble essentially meaningless, but check-passing characters into the field to progress their work

You could attempt to do another comparison of these to see for example if

  • they are all from a specific user
  • they are all from a specific date
  • the string representation becomes longer or shorter as time progresses (as the user becomes lazier or less sure they have used a value)
ti7
  • 16,375
  • 6
  • 40
  • 68
  • I agree with the premises, and this is probably the best piece of advice if one wants to get to the bottom of it. On the other hand, they are probably just IDs and it is more likely they do not bring any value *per se*, as long as you can convert them to something workable consistently and reliably. Keeping them as string may be sub-optimal but it will work. Getting to the original value blindfold might prove impossible. Converting them to `int64` the naïve way (str->float->int) may be good enough, but it is not guaranteed to work. – norok2 Nov 28 '22 at 22:53
0

testdf['CUSTID'] is a pandas.Series object containing Python string objects. For a pandas.Series object to contain large integer, the most straightforward type to use is the int Python objects (as opposed to native Numpy types that are more efficient). You can do a conversion to a Decimal type so to get ride of the non-integer part. The conversion can be done using map:

testdf['CUSTID'] = list(map(int, map(decimal.Decimal, testdf['CUSTID'].to_list())))

This is not very efficient, but both Unicode string objects and large variable-sized integer objects are actually inefficient. Since Numpy does not support large integer natively, this is certainly the best option (though one may find a faster way to get ride of the non-integer part than using the decimal package).

Here is a string-based parsing method that is certainly slower but supporting very large integers without using a large fixed-size decimal precision:

testdf['CUSTID'] = [int(s.split('.')[0]) for s in testdf['CUSTID'].to_list()]
Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Thank you very much for the input. However, Im afraid when I use this, I still have the issue of the int. So now the big number that exceeds int64's max, becomes simply a 'null' value. How can I remove the '.0' from the big number? – Trodenn Nov 28 '22 at 22:11
  • If the number is very very big, then the decimal package may not be able to fit it. You can tweak that by changing the context precision (see [here](https://docs.python.org/3/library/decimal.html)). The alternative solution is to use a string-based method to skip the non-integer part. I added such an implementation in the answer. – Jérôme Richard Nov 28 '22 at 22:15
0

I would recommend just leave them as string and trim everything after the .:

import pandas as pd


testdf = pd.DataFrame({'CUSTID': ['99418675896216.02342351', '88168142359034442077.0213', '53056496953']})
testdf['CUSTID'] = testdf['CUSTID'].apply(lambda s: s[:s.find(".")])
testdf.display()

Note that you could replace: lambda s: s[:s.find(".")] with something different, but I would not expect any variation (e.g. lambda s: s.split(".", 1)[0] or lambda s: re.match(r"^(\d+)(?:\.(\d+))?$", s).groups()[0]) to be much further than that. Just test them for some sample input to see which one works best for you.


Alternatively, you may want to use str method for Pandas series with extract(), i.e.:

testdf['CUSTID'] = testdf['CUSTID'].str.extract(r"^(\d+)(?:\.(\d+))?$")

but I am unsure this would be any faster than the aforementioned solutions.

Perhaps you can achieve something faster with rstrip() but your code would not be as simple as the above, since you would need to handle values without the . differently (no-op) from the rest.

norok2
  • 25,683
  • 4
  • 73
  • 99