I have a python module that loads data directly in to a dict of numpy.ndarray for use in a pandas.Dataframe. However, I noticed an issue with 'NA' values. My file format represents NA values a s -9223372036854775808 (boost::integer_traits::const_min). My non-NA values are loading as expected (with the right values) into pandas.Dataframe. I believe what is happening is that my module loads into a numpy.datetime64 ndarray, which then is converted to a list of pandas.tslib.Timestamp. This conversion doesn't seem to preserve the 'const_min' integer. Trye the following:
>>> pandas.tslib.Timestamp(-9223372036854775808)
NaT
>>> pandas.tslib.Timestamp(numpy.datetime64(-9223372036854775808))
<Timestamp: 1969-12-31 15:58:10.448384>
Is this a Pandas bug? I think I can have my module avoid using a numpy.ndarray in this case, and use something Pandas doesn't trip on (perhaps pre-allocate the list of tslib.Timestamp itself.)
Here is another example of unexpected things happening:
>>> npa = numpy.ndarray(1, dtype=numpy.datetime64)
>>> npa[0] = -9223372036854775808
>>> pandas.Series(npa)
0 NaT
>>> pandas.Series(npa)[0]
<Timestamp: 1969-12-31 15:58:10.448384>
Following Jeff's comment below, I have more information about what is going wrong.
>>> npa = numpy.ndarray(2, dtype=numpy.int64)
>>> npa[0] = -9223372036854775808
>>> npa[1] = 1326834000090451
>>> npa
array([-9223372036854775808, 1326834000090451])
>>> s_npa = pandas.Series(npa, dtype='M8[us]')
>>> s_npa
0 NaT
1 2012-01-17 21:00:00.090451
Yay! The series preserved the NA and my timestamp. However, if I attempt to create a DataFrame from that series, the NaT disappears.
>>> pandas.DataFrame({'ts':s_npa})
ts
0 1969-12-31 15:58:10.448384
1 2012-01-17 21:00:00.090451
Ho-hum. On a whim, I tried interpreting my integers as nano-seconds past epoch instead. To my surprise, the DataFrame worked properly:
s2_npa = pandas.Series(npa, dtype='M8[ns]')
>>> s2_npa
0 NaT
1 1970-01-16 08:33:54.000090451
>>> pandas.DataFrame({"ts":s2_npa})
ts
0 NaT
1 1970-01-16 08:33:54.000090451
Of course, my timestamp is not right. My point is that pandas.DataFrame is behaving inconsistently here. Why does it preserve the NaT when using dtype='M8[ns]', but not when using 'M8[us]'?
I am currently using this workaround to convert an , which slows things down quite a bit, but works:
>>> s = pandas.Series([1000*ts if ts != -9223372036854775808 else ts for ts in npa], dtype='M8[ns]')
>>> pandas.DataFrame({'ts':s})
ts
0 NaT
1 2012-01-17 21:00:00.090451
(Several hours later...)
Okay, I have progress. I've delved into the code to realize that the repr function on Series eventually calls '_format_datetime64', which checks 'isnull' and will print out 'NaT' That explains the difference between these two.
>>> pandas.Series(npa)
0 NaT
>>> pandas.Series(npa)[0]
<Timestamp: 1969-12-31 15:58:10.448384>
The former seems to honor the NA, but it only does so when printing. I suppose there may be other pandas functions that call 'isnull' and act based on the answer, which might seem to partially work for NA timestamps in this case. However, I know that the Series is incorrect due to the type of element zero. It is a Timestamp, but should be a NaTType. My next step is to dive into the constructor for Series to figure out when/how pandas uses the NaT value during construction. Presumably, it is missing a case when I specify dtype='M8[us]'... (more to come).
Following Andy's suggestion in the comments, I tried using a pandas Timestamp to resolve the issue. It didn't work. Here is an example of those results:
>>> npa = numpy.ndarray(1, dtype='i8')
>>> npa[0] = -9223372036854775808
>>> npa
array([-9223372036854775808])
>>> pandas.tslib.Timestamp(npa.view('M8[ns]')[0]).value
-9223372036854775808
>>> pandas.tslib.Timestamp(npa.view('M8[us]')[0]).value
-28909551616000