6

I'm trying to insert long integers in a Pandas Dataframe

import numpy as np
from pandas import DataFrame

data_scores = [(6311132704823138710, 273), (2685045978526272070, 23), (8921811264899370420, 45), (17019687244989530680L, 270), (9930107427299601010L, 273)]
dtype = [('uid', 'u8'), ('score', 'u8')]
data = np.zeros((len(data_scores),),dtype=dtype)
data[:] = data_scores
df_crawls = DataFrame(data)
print df_crawls.head()

But when I look in the dataframe, last values which are long are now negative :

                       uid  score
0  6311132704823138710    273
1  2685045978526272070     23
2  8921811264899370420     45
3 -1427056828720020936    270
4 -8516636646409950606    273

uids are 64 bits unsigned int, so 'u8' should be the correct dtype ? Any ideas ?

Tom
  • 257
  • 1
  • 3
  • 9
  • seems to be overflow. How about trying "bigger" data type? – goFrendiAsgard Nov 25 '12 at 12:31
  • With u16 : TypeError: data type not understood – Tom Nov 25 '12 at 13:45
  • your np-data looks fine, and the error suggests that pandas misses the `u`and gives you a signed long instead of unsigned. – deinonychusaur Nov 25 '12 at 14:12
  • My best guess is that numpy probably reserves the number of bits needed for each element in the array while pandas might be using c, in which case the size of a e.g. a long depend of the architecture of your system (32 vs 64 bit). So in short, the problem might be running your code on 32-bit computer. – deinonychusaur Nov 25 '12 at 14:36

2 Answers2

3

Yes-- it's a present limitation of pandas-- we do plan to add support for unsigned integer dtypes in the future. An error message would be much better:

http://github.com/pydata/pandas/issues/2355

For now you can make the column dtype=object as a workaround.

EDIT 2012-11-27

Detecting overflows now, though will become dtype=object for now until DataFrame has better support for unsigned data types.

In [3]: df_crawls
Out[3]: 
                    uid  score
0   6311132704823138710    273
1   2685045978526272070     23
2   8921811264899370420     45
3  17019687244989530680    270
4   9930107427299601010    273

In [4]: df_crawls.dtypes
Out[4]: 
uid      object
score     int64
Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
1

This won't tell you what to do, except try on a 64-bit computer or contact pandas developers (or patch the problem yourself...). But at any rate, this seems to be your problem:

The problem is that DataFrame does not understand unsigned int 64 bit, at least on a 32-bit machine.

I changed the values of your data_score to better be able to track what was happening:

data_scores = [(2**31 + 1, 273), (2 ** 31 - 1, 23), (2 ** 32 + 1, 45), (2 ** 63 - 1, 270), (2 ** 63 + 1, 273)]

Then I tried:

In [92]: data.dtype
Out[92]: dtype([('uid', '<u8'), ('score', '<u8')])

In [93]: data
Out[93]: 
array([(2147483649L, 273L), (2147483647L, 23L), (4294967297L, 45L),
       (9223372036854775807L, 270L), (9223372036854775809L, 273L)], 
      dtype=[('uid', '<u8'), ('score', '<u8')])

In [94]: df = DataFrame(data, dtype='uint64')

In [95]: df.values
Out[95]: 
array([[2147483649,                  273],
       [2147483647,                   23],
       [4294967297,                   45],
       [9223372036854775807,                  270],
       [-9223372036854775807,                  273]], dtype=int64)

Notice how the dtype of DataFrame doesn't match the one requested in row 94. Also as I wrote in the comment above, the numpy array works perfectly. Further, if you specify uint32 in row 94 it still specifies a dtype of int64 for the DataFrame values. However it doesn't give you negative overflows, probably because uint32 fits inside the positive values of the int64.

deinonychusaur
  • 7,094
  • 3
  • 30
  • 44
  • Personally I would see this as a bug in pandas that should be reported. Pandas should at very least throw a warning when doing such an unsafe cast from numpy, and an error when using a different type then explicitely asked for... – seberg Nov 25 '12 at 15:42
  • I agree with that it would be nicer, it is also worth noticing that it in fact makes a new copy of your data so that if the array was large you would use two times the memory... – deinonychusaur Nov 25 '12 at 16:02