Store and extract numpy datetimes in PyTables

Question

I want to store numpy datetime64 data in a PyTables Table. I want to do this without using Pandas.

What I've tried so far

Setup

In [1]: import tables as tb
In [2]: import numpy as np
In [3]: from datetime import datetime

create data

In [4]: data = [(1, datetime(2000, 1, 1, 1, 1, 1)), (2, datetime(2001, 2, 2, 2, 2, 2))]
In [5]: rec = np.array(data, dtype=[('a', 'i4'), ('b', 'M8[us]')])
In [6]: rec  # a numpy array with my data
Out[6]: 
array([(1, datetime.datetime(2000, 1, 1, 1, 1, 1)),
       (2, datetime.datetime(2001, 2, 2, 2, 2, 2))], 
      dtype=[('a', '<i4'), ('b', '<M8[us]')])

Open PyTables dataset with `Time64Col` descriptor

In [7]: f = tb.open_file('foo.h5', 'w')  # New PyTables file
In [8]: d = f.create_table('/', 'bar', description={'a': tb.Int32Col(pos=0), 
                                                    'b': tb.Time64Col(pos=1)})
In [9]: d
Out[9]: 
/bar (Table(0,)) ''
  description := {
  "a": Int32Col(shape=(), dflt=0, pos=0),
  "b": Time64Col(shape=(), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (5461,)

Append NumPy data to PyTables dataset

In [10]: d.append(rec)
In [11]: d
Out[11]: 
/bar (Table(2,)) ''
  description := {
  "a": Int32Col(shape=(), dflt=0, pos=0),
  "b": Time64Col(shape=(), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (5461,)

What happened to my datetimes?

In [12]: d[:]
Out[12]: 
array([(1, 0.0), (2, 0.0)], 
      dtype=[('a', '<i4'), ('b', '<f8')])

I understand that HDF5 doesn't provide native support for datetimes. I would expect that the extra metadata that PyTables overlays would handle this though.

My Question

How can I store a numpy record array that contains datetimes in PyTables? How can I efficiently extract that data from a PyTables table back to a NumPy array and retain my datetimes?

Common answer

I commonly get this answer:

Use Pandas

I don't want to use Pandas because I don't have an index, I don't want one stored in my dataset, and Pandas doesn't allow you to not have/store an index (see this question)

score 5 · Answer 1 · edited May 23 '17 at 12:15

First, when putting the values into a Time64Col, they need to be float64s. You can do this with a call to astype, like so:

new_rec = rec.astype([('a', 'i4'), ('b', 'f8')])

Then you need to convert column b to seconds since the epoch which means you need to divide by 1,000,000 since we're in microseconds:

new_rec['b'] = new_rec['b'] / 1e6

Then call d.append(new_rec)

When you read the array back into memory, do the reverse and multiply by 1,000,000. You'll have to make sure things are in microseconds before putting anything in, which is handled automatically by astype('datetime64[us]') in numpy >= 1.7.x

I used the solution from this question: How to get unix timestamp from numpy.datetime64

Here's a working version of your example:

In [4]: data = [(1, datetime(2000, 1, 1, 1, 1, 1)), (2, datetime(2001, 2, 2, 2, 2, 2))]

In [5]: rec = np.array(data, dtype=[('a', 'i4'), ('b', 'M8[us]')])

In [6]: new_rec = rec.astype([('a', 'i4'), ('b', 'f8')])

In [7]: new_rec
Out[7]:
array([(1, 946688461000000.0), (2, 981079322000000.0)],
      dtype=[('a', '<i4'), ('b', '<f8')])

In [8]: new_rec['b'] /= 1e6

In [9]: new_rec
Out[9]:
array([(1, 946688461.0), (2, 981079322.0)],
      dtype=[('a', '<i4'), ('b', '<f8')])

In [10]: f = tb.open_file('foo.h5', 'w')  # New PyTables file

In [11]: d = f.create_table('/', 'bar', description={'a': tb.Int32Col(pos=0),
   ....:                                             'b': tb.Time64Col(pos=1)})

In [12]: d.append(new_rec)

In [13]: d[:]
Out[13]:
array([(1, 946688461.0), (2, 981079322.0)],
      dtype=[('a', '<i4'), ('b', '<f8')])

In [14]: r = d[:]

In [15]: r['b'] *= 1e6

In [16]: r.astype([('a', 'i4'), ('b', 'datetime64[us]')])
Out[16]:
array([(1, datetime.datetime(2000, 1, 1, 1, 1, 1)),
       (2, datetime.datetime(2001, 2, 2, 2, 2, 2))],
      dtype=[('a', '<i4'), ('b', '<M8[us]')])

now I remember why pandas doesn't use TimeCol64 because is backed by float64 which can't support nanosecond precision — Jeff, Sep 07 '14 at 03:01
Yep. The amount of work needed to round trip a `datetime64` dwarfs any benefit provided by giving the `Time64Col` metadata to PyTables. — Phillip Cloud, Sep 07 '14 at 03:03