14

This question is motivated by an answer to a question on improving performance when performing comparisons with DatetimeIndex in pandas.

The solution converts the DatetimeIndex to a numpy array via df.index.values and compares the array to a np.datetime64 object. This appears to be the most efficient way to retrieve the Boolean array from this comparison.

The feedback on this question from one of the developers of pandas was: "These are not the same generally. Offering up a numpy solution is often a special case and not recommended."

My questions are:

  1. Are they interchangeable for a subset of operations? I appreciate DatetimeIndex offers more functionality, but I require only basic functionality such as slicing and indexing.
  2. Are there any documented differences in result for operations that are translatable to numpy?

In my research, I found some posts which mention "not always compatible" - but none of them seem to have any conclusive references / documentation, or specify why/when generally they are incompatible. Many other posts use the numpy representation without comment.

jpp
  • 159,742
  • 34
  • 281
  • 339

1 Answers1

14

In my opinion, you should always prefer using a Timestamp - it can easily transform back into a numpy datetime in the case it is needed.

numpy.datetime64 is essentially a thin wrapper for int64. It has almost no date/time specific functionality.

pd.Timestamp is a wrapper around a numpy.datetime64. It is backed by the same int64 value, but supports the entire datetime.datetime interface, along with useful pandas-specific functionality.

The in-array representation of these two is identical - it is a contigous array of int64s. pd.Timestamp is a scalar box that makes working with individual values easier.

Going back to the linked answer, you could write it like this, which is shorter and happens to be faster.

%timeit (df.index.values >= pd.Timestamp('2011-01-02').to_datetime64()) & \
        (df.index.values < pd.Timestamp('2011-01-03').to_datetime64())
192 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Ben
  • 20,038
  • 30
  • 112
  • 189
chrisb
  • 49,833
  • 8
  • 70
  • 70
  • Your explanation makes sense. But I'm still confused. Typically, the route to optimise `pandas` is to drop down to `numpy` [and then maybe `numba` or `cython`]. Is there a specific reason this is inadvisable specifically for `pd.Timestamp`? – jpp Apr 11 '18 at 00:27
  • 2
    That advice while it often works in practice, is very simplistic. Numpy isn't inherently faster than pandas, it's more the case the pandas often uses numpy internally, so if you know exactly what you want you can elide some overhead. In this case, the array operation is identical either way, just a faster scalar construction. – chrisb Apr 11 '18 at 15:22
  • 1
    How do you convert an entire column into `pd.Timestamp`, since `pd.to_datetime()` returns datetime64? – j7skov Jan 12 '23 at 14:08