1

Numpy int arrays can't store missing values.

>>> import numpy as np
>>> np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> myArray = np.arange(10)
>>> myArray.dtype
dtype('int32')

>>> myArray[0] = None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

>>> myArray.astype( dtype = 'float')
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])
>>> myFloatArray = myArray.astype( dtype = 'float')
>>> myFloatArray[0] = None

>>> myFloatArray
array([ nan,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.])

Pandas warns about this in the docs - Caveats and Gotchas, Support for int NA. Wes McKinney also reiterates the point in this stack question

I need to be able to store missing values in an int array. I'm INSERTing rows into my database which I've set up to accept only ints of varying sizes.

My current work around is to store the array as an object, which can hold both ints and None-types as elements.

>>> myArray.astype( dtype = 'object')
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)
>>> myObjectArray = myArray.astype( dtype = 'object')
>>> myObjectArray[0] = None
>>> myObjectArray 
array([None, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)

This seems to be memory intensive and slow for large data-sets. I was wondering if anyone has a better solution while the numpy development is underway.

Community
  • 1
  • 1
Nirvan
  • 623
  • 7
  • 19
  • 1
    What about `numpy.ma.MaskedArray`? – MSeifert Feb 22 '17 at 20:06
  • Interesting. Do you know if it works with Pandas? – Nirvan Feb 22 '17 at 20:15
  • 1
    It might be bit hacky. Can't you assign a designated integer to fill those missing values? You could reserve such an integer number to only fill the missing values. why do you want to insert only `None` in place of missing values? – kmario23 Feb 22 '17 at 20:17
  • 1
    What calculations are you doing with your data? How are the missing values supposed to be handled? Those issues are as important as storage density. Floats with `nan` and masked arrays have addressed some of those details. – hpaulj Feb 23 '17 at 00:47
  • Calculations isn't the problem -- the 'missing' values aren't missing because of random non-response or something. The null values usually indicate an individual is 'Not in the universe' for a particular variable e.g kids under 15 have a NULL for Labor Force Status; they aren't in the labor force but they're not 'Not in labor force' either so we indicate that with a NULL value. – Nirvan Feb 23 '17 at 15:32
  • And I think it's a matter of preference for our group to use NULL values instead of integer values -- it's less cognitive load when looking at the table. – Nirvan Feb 23 '17 at 15:33

1 Answers1

0

I found a very quick way to convert all the missing values in my dataframe into None types. the .where method

mydata = mydata.where( pd.notnull( mydata ), None )

It is much less memory intensive than what I was doing before.

Nirvan
  • 623
  • 7
  • 19