3

I have a python module that loads data directly in to a dict of numpy.ndarray for use in a pandas.Dataframe. However, I noticed an issue with 'NA' values. My file format represents NA values a s -9223372036854775808 (boost::integer_traits::const_min). My non-NA values are loading as expected (with the right values) into pandas.Dataframe. I believe what is happening is that my module loads into a numpy.datetime64 ndarray, which then is converted to a list of pandas.tslib.Timestamp. This conversion doesn't seem to preserve the 'const_min' integer. Trye the following:

>>> pandas.tslib.Timestamp(-9223372036854775808)
NaT
>>> pandas.tslib.Timestamp(numpy.datetime64(-9223372036854775808))
<Timestamp: 1969-12-31 15:58:10.448384>

Is this a Pandas bug? I think I can have my module avoid using a numpy.ndarray in this case, and use something Pandas doesn't trip on (perhaps pre-allocate the list of tslib.Timestamp itself.)

Here is another example of unexpected things happening:

>>> npa = numpy.ndarray(1, dtype=numpy.datetime64)
>>> npa[0] = -9223372036854775808
>>> pandas.Series(npa)
0   NaT
>>> pandas.Series(npa)[0]
<Timestamp: 1969-12-31 15:58:10.448384>

Following Jeff's comment below, I have more information about what is going wrong.

>>> npa = numpy.ndarray(2, dtype=numpy.int64)
>>> npa[0] = -9223372036854775808
>>> npa[1] = 1326834000090451
>>> npa
array([-9223372036854775808,     1326834000090451])
>>> s_npa = pandas.Series(npa, dtype='M8[us]')
>>> s_npa
0                          NaT
1   2012-01-17 21:00:00.090451

Yay! The series preserved the NA and my timestamp. However, if I attempt to create a DataFrame from that series, the NaT disappears.

>>> pandas.DataFrame({'ts':s_npa})
                      ts
0 1969-12-31 15:58:10.448384
1 2012-01-17 21:00:00.090451

Ho-hum. On a whim, I tried interpreting my integers as nano-seconds past epoch instead. To my surprise, the DataFrame worked properly:

s2_npa = pandas.Series(npa, dtype='M8[ns]')
>>> s2_npa
0                             NaT
1   1970-01-16 08:33:54.000090451
>>> pandas.DataFrame({"ts":s2_npa})
                             ts
0                           NaT
1 1970-01-16 08:33:54.000090451

Of course, my timestamp is not right. My point is that pandas.DataFrame is behaving inconsistently here. Why does it preserve the NaT when using dtype='M8[ns]', but not when using 'M8[us]'?

I am currently using this workaround to convert an , which slows things down quite a bit, but works:

>>> s = pandas.Series([1000*ts if ts != -9223372036854775808 else ts for ts in npa], dtype='M8[ns]')
>>> pandas.DataFrame({'ts':s})
                          ts
0                        NaT
1 2012-01-17 21:00:00.090451

(Several hours later...)

Okay, I have progress. I've delved into the code to realize that the repr function on Series eventually calls '_format_datetime64', which checks 'isnull' and will print out 'NaT' That explains the difference between these two.

>>> pandas.Series(npa)
0   NaT
>>> pandas.Series(npa)[0]
<Timestamp: 1969-12-31 15:58:10.448384>

The former seems to honor the NA, but it only does so when printing. I suppose there may be other pandas functions that call 'isnull' and act based on the answer, which might seem to partially work for NA timestamps in this case. However, I know that the Series is incorrect due to the type of element zero. It is a Timestamp, but should be a NaTType. My next step is to dive into the constructor for Series to figure out when/how pandas uses the NaT value during construction. Presumably, it is missing a case when I specify dtype='M8[us]'... (more to come).

Following Andy's suggestion in the comments, I tried using a pandas Timestamp to resolve the issue. It didn't work. Here is an example of those results:

>>> npa = numpy.ndarray(1, dtype='i8')
>>> npa[0] = -9223372036854775808
>>> npa
array([-9223372036854775808])
>>> pandas.tslib.Timestamp(npa.view('M8[ns]')[0]).value
-9223372036854775808
>>> pandas.tslib.Timestamp(npa.view('M8[us]')[0]).value
-28909551616000
D. A.
  • 3,369
  • 3
  • 31
  • 34
  • I think your example isn't great Pandas' Timestamp doesn't use the same constructor for integers as numpys datetime64... however you'd hope that converting between the two would be consistent (but it doesn't seem to be :s). Do you think you could provide some code to say what you're doing? – Andy Hayden Feb 04 '13 at 21:13
  • I've added another code snipped that gives more information. I am using C++ to load data from disk directly into a dict of numpy.ndarray, which I then use to create a pandas.DataFrame. This works great (very fast) for int64, float64 and even datetime64. The problem is with the treatment of NA values. An alternative approach I am going to try is pre-allocating an array of pandas.tslib.Timestamp, and load into that directly. I'm not certain this is possible. – D. A. Feb 04 '13 at 21:22
  • [Welcome to hell](http://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64/13753918#13753918)... :S – Andy Hayden Feb 04 '13 at 21:41
  • what version of numpy are you using? < 1.6.2 is problematic - http://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64 – Jeff Feb 05 '13 at 01:03
  • you can also try Series(values, dtype='M8[ns]'), and read your values as an int64 ndarray; dispensing completely with np.datetime64. your integer NaT will work for missing values as well – Jeff Feb 05 '13 at 01:34
  • Jeff, Numpy 1.6.2 indeed. I'll read that post to see if I can understand the Numpy issue and work around it. Your suggestion for 'M8[ns]' may also work for me as well. – D. A. Feb 05 '13 at 14:40
  • and coming in 0.10.2 (or 0.11-dev), Series will 'try harder' to do these conversions – Jeff Feb 05 '13 at 21:42
  • Andy, I read your link (didn't realize it was a link at first). I'm not sure it resolves the problem. I added another example in the post to show my attempt at using Timestamp to resolve the issue. – D. A. Feb 05 '13 at 22:45

1 Answers1

2

Answer: No

Technically speaking, that is. I posted the bug on github and got a response here: https://github.com/pydata/pandas/issues/2800#issuecomment-13161074

"Units other than nanoseconds are not supported right now in indexing etc. This should be strictly enforced"

All of the tests I've run with 'ns' rather than 'us' work fine. I'm looking forward to a future release.

For anyone interested, I modified my C++ python module to iterate over the int64_t arrays that I loaded from disk, and multiply everything by 1000, except for NA values (boost::integer_traits::const_min). I was worried about the performance, but the difference in load time is tiny for me. (Doing the same in Python is very, very slow.)

D. A.
  • 3,369
  • 3
  • 31
  • 34