0

I have a list of np.datetime64 data that looks as follows:

times =[2015-03-26T16:02:42.000000Z,
 2015-03-26T16:02:45.000000Z,...]

type(times) returns list

type(times[1]) returns obspy.core.utcdatetime.UTCDateTime

Now, I understand that h5py does not support date time data.

I have tried the following:

time_str = [n.encode("ascii", "ignore") for n in time_str]
time_str = [str(s) for s in time_str]

type(time_str[1]) returns bytes

I am okay with creating the dataset and storing these date time values as a string

However, when attempting to create the dataset, I get the following error:

with h5py.File('data_ML.hdf5', 'w') as f:
           f.create_dataset("time", data=time_str,maxshape=(None),chunks=True, dtype='str')

TypeError: No conversion path for dtype: dtype('<U')

Where am I messing up/ is there an alternative way to store these values as is so I can extract them later?

  • What does `type(time_str[1])` give you? – JeffUK Jul 19 '21 at 16:54
  • it gives me: bytes – Steve Bermeo Jul 19 '21 at 16:57
  • I think you need to make it String, not Bytes, as your dtype='str' – JeffUK Jul 19 '21 at 16:59
  • Would it be in the n.encode? I thought I was converting them to strings already – Steve Bermeo Jul 19 '21 at 17:03
  • h5py doesn't support NumPy's Unicode dtype. You have to use string dtype. So, if you have `U20`, use `S20` instead. Read this: [h5py: What about NumPy’s U type?](https://docs.h5py.org/en/stable/strings.html#what-about-numpy-s-u-type). SO has several questions and answers on this subject. This is a recent one: https://stackoverflow.com/a/68199414/10462884 -- note: it needs a better title, it's not really a broadcast error. – kcw78 Jul 19 '21 at 17:30
  • Yes, I read about the numpy unicode. I am just very confused since I never dealt with h5py and conversions like this are very unclear to me – Steve Bermeo Jul 19 '21 at 18:22

1 Answers1

1

Ok, here we go. I couldn't get some of you code to work together (maybe you left some steps out, or changed variable names?). And, I could not get the obspy.core.utcdatetime.UTCDateTime object your have.

So I created an example that does the following:

  1. Starts with a list of np.datetime64() objects,
  2. Converts to a list of np.datetime_as_string() in UTC format objects **see note at Item 4
  3. Converts to a np.array with dtype='S30'
  4. Note: I included Step 2 to replicate your data. See following section for simpler version

Code below:

times =[np.datetime64('2015-03-26T16:02:42.000000'),
        np.datetime64('2015-03-26T16:02:45.000000'),
        np.datetime64('2015-03-26T16:02:48.000000'),
        np.datetime64('2015-03-26T16:02:55.000000') ]

utc_times = [ np.datetime_as_string(n,timezone='UTC') for n in times ]
  
utc_str_arr = np.array(utc_times,dtype='S30')   

with h5py.File('data_ML.hdf5', 'w') as f:
     f.create_dataset("time", data=utc_str_arr,maxshape=(None),chunks=True)

You can simplify the process if you are starting with np.datetime64() objects, and don't have (and don't need or want) the intermediate list of string objects (variable utc_times in my code). The method below skips Step 2 above, and shows 2 ways to create a np.array() of properly encoded strings.

Code below:

times =[np.datetime64('2015-03-26T16:02:42.000000'),
        np.datetime64('2015-03-26T16:02:45.000000'),
        np.datetime64('2015-03-26T16:02:48.000000'),
        np.datetime64('2015-03-26T16:02:55.000000') ]

# Create empty array with defined size and 'S#' dtype, then populate with for loop:
utc_str_arr1 = np.empty((len(times),),dtype='S30')
for i, n in enumerate(times):
    utc_str_arr1[i] = np.datetime_as_string(n,timezone='UTC')

# -OR- Create array and populate using loop comprehension:
utc_str_arr2 = np.array( [np.datetime_as_string(n,timezone='UTC').encode('utf-8') for n in times] )

with h5py.File('data_ML.hdf5', 'w') as f:
     f.create_dataset("time1", data=utc_str_arr1,maxshape=(None),chunks=True)
     f.create_dataset("time2", data=utc_str_arr2,maxshape=(None),chunks=True)

Final result looks similar with either method (second method creates 2 identical datsets).
Image from HDFView:
https://stackoverflow.com/a/46921593/10462884

To Read the Data:
Per request in Aug-02-2021 comment, here is the code to extract data from HDF5 and create Pandas timestamp objects (then saved to a dataframe). First the byte strings in the dataset are read and converted to NumPy Unicode strings with .astype(). Then the strings are converted to Pandas timestamp objects with pd.to_datetime() using the format= parameter.

import h5py
import numpy as np
import pandas as pd

with h5py.File('data_ML.hdf5', 'r') as h5f:
    ## returns a h5py dataset object:
    dts_ds = h5f["time"]
    longest_word=len(max(dts_ds, key=len))
    
    ## returns an array of byte strings representing np.datetime64:
    ## .astype() used to convert byte strings to unicode
    dts_arr = dts_ds[:].astype('U'+str(longest_word))

    ## create a new array to hold Pandas datetime objects
    ## then loop over first array to convert and populate new array
    pd_dts_arr = np.empty((dts_arr.shape[0],),dtype=object)
    for i, dts in enumerate(dts_arr):    
        pd_dts_arr[i] = pd.to_datetime(dts, format='%Y-%m-%dT%H:%M:%S.%fZ')
        
    dts_df = pd.DataFrame(pd_dts_arr)

There are a lot of ways to represent dates and time using native Python, NumPy and Pandas objects. More details about working with them can be found at this answer: Converting between datetime, Timestamp and datetime64

kcw78
  • 7,131
  • 3
  • 12
  • 44
  • THANK YOU! This solved the issue beautifully – Steve Bermeo Jul 21 '21 at 19:05
  • PS... For opening the hdf file and then comparing it back to a pandas timestamp, what would the approach be? – Steve Bermeo Aug 02 '21 at 17:01
  • @Steve Bermeo...that's a different (new) question. :-) Your example starts with a list of `np.datetime64` data. I am not familiar with Pandas timestamp objects. Do they use `np.datetime64` to save time/date data? If so, you simply reserve the steps above -- 1) open the file in read mode, 2) read the `time` dataset (as an array of strings), 3) then convert the strings back to `np.datetime64` objects. – kcw78 Aug 02 '21 at 20:40
  • Hi Steve. Challenge accepted. I added to my previous answer showing how to read the data and convert to a Pandas timestamp object (at least, I think that's the object you want). Check out the linked answer at the end. It covers just about every date/time object conversion you might want to do. The trick in my code is reading the HDF5 byte strings and coverting to a format you can use with one of those methods. – kcw78 Aug 03 '21 at 00:25