15

I've recently stumbled upon a new awesome pendulum library for easier work with datetimes.

In pandas, there is this handy to_datetime() method allowing to convert series and other objects to datetimes:

raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')

What would be the canonical way to create a custom to_<something> method - in this case to_pendulum() method which would be able to convert Series of date strings directly to Pendulum objects?

This may lead to Series having various interesting capabilities like, for instance, converting a series of date strings to a series of "offsets from now" - human datetime diffs.

cs95
  • 379,657
  • 97
  • 704
  • 746
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Hmm, what do you have in mind? You can subclass `Series` objects, wherein you can add a `to_pendulum` method that does what you want. – cs95 Dec 16 '17 at 19:47
  • [Here's](http://pandas.pydata.org/pandas-docs/stable/internals.html#subclassing-pandas-data-structures) the official guide on subclassing series. – cs95 Dec 16 '17 at 19:48
  • @cᴏʟᴅsᴘᴇᴇᴅ I was initially thinking about just calling an `apply()` method, but then I have a very limited knowledge of pandas and was not sure about the most appropriate way to create a custom conversion method like this. Will read the guide, thanks! – alecxe Dec 16 '17 at 19:50
  • Ah, alright. It seems like I misunderstood. So, you have a column of datetimes and you'd like to apply this pendulum diff_for_humans function? (sorry, I'm unfamiliar with this library). If it's as simple as that, you could just define a function and pass it to `pd.Series.apply`, subclassing a Series would just be overkill. – cs95 Dec 16 '17 at 19:55
  • @cᴏʟᴅsᴘᴇᴇᴅ no problem, I was not clear enough. I was thinking to initially convert a column (series) of datetime strings to a column of Pendulum objects so that later on be able to make datetime operations easier - human datetime ldifferences, timezone conversions and other convenient things that pendulum offers. – alecxe Dec 16 '17 at 20:03
  • Actually, I retract my first answer. I don't think converting a series to a pendulum object is a great idea. 1) Pandas does not natively support pendulum 2) You will not see any performance benefits 3) Pandas attempts to coerce anything that it can to a known format, so it keeps trying to caste pendulum objects to timestamp. This is going to keep biting you multiple times. – cs95 Dec 16 '17 at 20:23
  • This is a consequence of the fact that `isinstance(pendulum.now(), datetime.datetime) >>> True`, so pandas special cases this sort of object, coercing it where possible to `Timestamp`. – cs95 Dec 16 '17 at 20:25
  • @cᴏʟᴅsᴘᴇᴇᴅ ah, did not know these things about pandas. Would really appreciate if you could summarize it in an answer. Thanks, learned something new today! – alecxe Dec 16 '17 at 20:38

1 Answers1

19

What would be the canonical way to create a custom to_<something> method - in this case to_pendulum() method which would be able to convert Series of date strings directly to Pendulum objects?

After looking through the API a bit, I must say I'm impressed with what they've done. Unfortunately, I don't think Pendulum and pandas can work together (at least, with the current latest version - v0.21).

The most important reason is that pandas does not natively support Pendulum as a datatype. All the natively supported datatypes (np.int, np.float and np.datetime64) all support vectorisation in some form. You are not going to get a shred of performance improvement using a dataframe over, say, a vanilla loop and list. If anything, calling apply on a Series with Pendulum objects is going to be slower (because of all the API overheads).

Another reason is that Pendulum is a subclass of datetime -

from datetime import datetime

isinstance(pendulum.now(), datetime)
True

This is important, because, as mentioned above, datetime is a supported datatype, so pandas will attempt to coerce datetime to pandas' native datetime format - Timestamp. Here's an example.

print(s)

0     2017-11-09 18:43:45
1     2017-11-09 20:15:27
2     2017-11-09 22:29:00
3     2017-11-09 23:42:34
4     2017-11-10 00:09:40
5     2017-11-10 00:23:14
6     2017-11-10 03:32:17
7     2017-11-10 10:59:24
8     2017-11-10 11:12:59
9     2017-11-10 13:49:09

s = s.apply(pendulum.parse)
s

0    2017-11-09 18:43:45+00:00
1    2017-11-09 20:15:27+00:00
2    2017-11-09 22:29:00+00:00
3    2017-11-09 23:42:34+00:00
4    2017-11-10 00:09:40+00:00
5    2017-11-10 00:23:14+00:00
6    2017-11-10 03:32:17+00:00
7    2017-11-10 10:59:24+00:00
8    2017-11-10 11:12:59+00:00
9    2017-11-10 13:49:09+00:00
Name: timestamp, dtype: datetime64[ns, <TimezoneInfo [UTC, GMT, +00:00:00, STD]>]

s[0]
Timestamp('2017-11-09 18:43:45+0000', tz='<TimezoneInfo [UTC, GMT, +00:00:00, STD]>')

type(s[0])
pandas._libs.tslib.Timestamp

So, with some difficulty (involving dtype=object), you could load Pendulum objects into dataframes. Here's how you'd do that -

v = np.vectorize(pendulum.parse)
s = pd.Series(v(s), dtype=object)

s

0     2017-11-09T18:43:45+00:00
1     2017-11-09T20:15:27+00:00
2     2017-11-09T22:29:00+00:00
3     2017-11-09T23:42:34+00:00
4     2017-11-10T00:09:40+00:00
5     2017-11-10T00:23:14+00:00
6     2017-11-10T03:32:17+00:00
7     2017-11-10T10:59:24+00:00
8     2017-11-10T11:12:59+00:00
9     2017-11-10T13:49:09+00:00

s[0]
<Pendulum [2017-11-09T18:43:45+00:00]>

However, this is essentially useless, because calling any pendulum method (via apply) will now not only be super slow, but will also end up in the result being coerced to Timestamp again, an exercise in futility.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 3
    For future readers...Github issue on making pandas work with pendulum: https://github.com/pandas-dev/pandas/issues/15986 – Hatshepsut Jan 20 '19 at 05:44