6

Since matplotlib doesn't support eitherpandas.TimeStamp ornumpy.datetime64, and there are no simple workarounds, I decided to convert a native pandas date column into a pure python datetime.datetime so that scatter plots are easier to make.

However:

t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31')]})
t.dtypes # date    datetime64[ns], as expected
pure_python_datetime_array = t.date.dt.to_pydatetime() # works fine
t['date'] = pure_python_datetime_array # doesn't do what I hoped
t.dtypes # date    datetime64[ns] as before, no luck changing it

I'm guessing pandas auto-converts the pure python datetime produced by to_pydatetime into its native format. I guess it's convenient behavior in general, but is there a way to override it?

Community
  • 1
  • 1
max
  • 49,282
  • 56
  • 208
  • 355
  • I'm having trouble understanding what format you actually want. Do you just want the date? Or also the time? See e.g. http://codrspace.com/szeitlin/biking-data-from-xml-to-plots-part-2/ – szeitlin Sep 01 '16 at 18:18
  • I want the column `date` to have actual `datetime.datetime` objects. The ones that are returned by `to_pydatetime()` function. I don't want `TimeStamp` in that column because matplotlib can't make scatter plots with it. – max Sep 01 '16 at 18:24

3 Answers3

4

The use of to_pydatetime() is correct.

In [87]: t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31'), pd.to_datetime('2013-12-31')]})

In [88]: t.date.dt.to_pydatetime()
Out[88]: 
array([datetime.datetime(2012, 12, 31, 0, 0),
       datetime.datetime(2013, 12, 31, 0, 0)], dtype=object)

When you assign it back to t.date, it automatically converts it back to datetime64

pandas.Timestamp is a datetime subclass anyway :)

One way to do the plot is to convert the datetime to int64:

In [117]: t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31'), pd.to_datetime('2013-12-31')], 'sample_data': [1, 2]})

In [118]: t['date_int'] = t.date.astype(np.int64)

In [119]: t
Out[119]: 
        date  sample_data             date_int
0 2012-12-31            1  1356912000000000000
1 2013-12-31            2  1388448000000000000

In [120]: t.plot(kind='scatter', x='date_int', y='sample_data')
Out[120]: <matplotlib.axes._subplots.AxesSubplot at 0x7f3c852662d0>

In [121]: plt.show()

enter image description here

Another workaround is (to not use scatter, but ...):

In [126]: t.plot(x='date', y='sample_data', style='.')
Out[126]: <matplotlib.axes._subplots.AxesSubplot at 0x7f3c850f5750>

And, the last work around:

In [141]: import matplotlib.pyplot as plt

In [142]: t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31'), pd.to_datetime('2013-12-31')], 'sample_data': [100, 20000]})

In [143]: t
Out[143]: 
        date  sample_data
0 2012-12-31          100
1 2013-12-31        20000
In [144]: plt.scatter(t.date.dt.to_pydatetime()  , t.sample_data)
Out[144]: <matplotlib.collections.PathCollection at 0x7f3c84a10510>

In [145]: plt.show()

enter image description here

This has an issue at github, which is open as of now.

Nehal J Wani
  • 16,071
  • 3
  • 64
  • 89
  • yes, I think the problem is that `t.date` is not the recommended way to reference a column named 'date'. It's a lot clearer to do `t['date']`. – szeitlin Sep 01 '16 at 18:21
  • I need the column to contain the pure python datetime.datetime objects. That way, the call to `df.plot(kind='scatter', ...)` will not fail. – max Sep 01 '16 at 18:23
  • @max Instead of `df.plot`, you can call `matplotlib.pyplot.scatter` directly. I have updated the answer. – Nehal J Wani Sep 01 '16 at 18:45
  • 1
    So there's no way to force pandas to store pure python `datetime.datetime` objects inside a `DataFrame`? Because that would have been the easiest workaround in my case, where the dataset isn't so big that I care about performance. – max Sep 01 '16 at 20:41
  • only the last workaround maintains the correct x axis labels and doesn't depend on the distance between x values being the same. i guess it's not that much extra typing, so probably the best (except won't work with `groupby`). – max Sep 01 '16 at 22:01
2

Here is a possible solution with the Series class from pandas:

t = pd.DataFrame({'date': [pd.to_datetime('2012-12-31')]})
t.dtypes # date    datetime64[ns], as expected
pure_python_datetime_array = t.date.dt.to_pydatetime() # works fine
t['date'] = pd.Series(pure_python_datetime_array, dtype=object) # should do what you expect
t.dtypes # object, but the type of the date column is now correct! datetime
type(t.values[0, 0]) # datetime, now you can access the datetime object directly

Why is this working? My assumption is, that you force the dtype for the column date to be an object. So that pandas does not do any intern conversion from datetime.datetime to datetime64.

Correct me otherwise, if I am wrong.

PiMathCLanguage
  • 363
  • 4
  • 15
0

For me, the steps look like this:

  1. convert timezone with pytz
  2. convert to_datetime with pandas and make that the index
  3. plot and autoformat

Starting df looks like this:

before converting timestamps

  1. import pytz ts['posTime']=[x.astimezone( pytz.timezone('US/Pacific')) for x in ts['posTime']]

I can see that it worked because the timestamps changed format:

after timezone conversion

  1. sample['posTime'] = pandas.to_datetime(sample['posTime'])

    sample.index = sample['posTime']

At this point, just plotting with pandas (which uses matplotlib under the hood) gives me a nice rotation and totally the wrong format:

after pandas datetime conversion

  1. However, there's nothing wrong with the format of the objects. I can now make a scatterplot with matplotlib and it autoformats the datetimes as you'd expect.

    plt.scatter(sample['posTime'].values, sample['Altitude'].values)

    fig = plt.gcf()

    fig.set_size_inches(9.5, 3.5)

formatted

  1. If you use the auto format method, you can zoom in and it will continue to automatically choose the appropriate format (but you still have to set the scale manually).

autoformatted

szeitlin
  • 3,197
  • 2
  • 23
  • 19