4

I created a class to hold experiment results from my research (I'm an EE phd student) like

class Trial:
    def __init__(self, subID, triID):
        self.filePath = '' # file path of the folder
        self.subID = -1    # int
        self.triID = -1    # int
        self.data_A = -1   # numpy array
        self.data_B = -1   # numpy array
        ......

It's a mix of many bools, int, and numpy arrays. You get the idea. I read that it is faster when loading if the data is in hdf5 format. Can I do it with my data, which is a python list of my Trial object?

Note that there is a similar question on stackoverflow. But it only has one answer, which doesn't answer the question. Instead, it breaks down the OP's custom class into basic data types and store them into individual datasets. I'm not against doing that, but I want to know if it's the only way because it's against the philosophy of object oriented.

McAngus
  • 1,826
  • 18
  • 34
F.S.
  • 1,175
  • 2
  • 14
  • 34
  • 1
    `pickle` probably the easier way of saving your own class. It is designed around Python objects. It saves numpy arrays in the same as `np.save`. `h5py` is writes `numpy` arrays (plus strings and scalars) to `hdf5`. That's what the link is doing. You have to write your own `save` method that saves the class attributes. `pandas` uses another interface, `pytables`, but that still ends up writing arrays and strings, not 'objects' – hpaulj Jan 17 '18 at 05:22
  • 1
    HDF5 is a generic container and doesn't support stored python objects. If you want to use that format you'll need to break your object into different fields. Have you looked at python's pickle? That's a very simple method for storing objects but it isn't portable across languages. You could also serialize this to JSON fairly easily if the data isn't too big. – bivouac0 Jan 17 '18 at 05:25
  • @hpaulj I just looked at some introduction to `pickle` and i see your point. But does it increase the loading speed (which is why I wanted to use hdf5)? Currently my program reads a bunch of txt files everytime it runs. Each file contains a table that will be loaded into a numpy array. – F.S. Jan 17 '18 at 06:52
  • @bivouac0 thank you, I didn't know about pickle or JSON. I'm curious if they save loading time, compared to loading text files to create my objects each time I run the program (see my message above) – F.S. Jan 17 '18 at 06:56
  • How are you loading the text files? Or saving them? – hpaulj Jan 17 '18 at 07:14
  • I would start with using `cPickle`. Parsing a bunch of individual text files takes takes time where `cPickle` is a single stream read/write to disk. `json` is faster for dictionary objects but it's probably not a great choice for a bunch of floating data (note that `ujson` is even better, but again for dictionaries not large numpy arrays). – bivouac0 Jan 17 '18 at 11:30
  • @hpaulj I load them using `np.loadtxt`. For example `self.eegFile = np.loadtxt(self.filePath + self.date + '_eeg_' + self.fileID + '.txt', skiprows=1)` – F.S. Jan 17 '18 at 17:11
  • What's the big deal then? You are already saving/loading the attributes of your object to files as arrays. `np.save/load` is a faster array format. `np.savez` saves multiple arrays in a `zip` archive. If you don't need to share the data with other programs, the native `np.save` is probably faster than `h5py`. – hpaulj Jan 17 '18 at 17:35
  • @hpaulj that's probably the information I'm looking for. I guess I was under the wrong impression that hdf5 has faster loading time compared to native methods – F.S. Jan 19 '18 at 04:49

2 Answers2

3

Here's a small class that I use for saving data like this. You can use it by doing something like..

dc = DataContainer()
dc.trials = <your list of trial objects here>
dc.save('mydata.pkl')

Then to load do..

dc = DataContainer.load('mydata.pkl')

Here's the DataContainer file:

import gzip
import cPickle as pickle

# Simple container with load and save methods.  Declare the container
# then add data to it.  Save will save any data added to the container.
# The class automatically gzips the file if it ends in .gz
#
# Notes on size and speed (using UbuntuDialog data)
#       pkl     pkl.gz
# Save  11.4s   83.7s
# Load   4.8s   45.0s
# Size  596M    205M
#
class DataContainer(object):
    @staticmethod
    def isGZIP(filename):
        if filename.split('.')[-1] == 'gz':
            return True
        return False

    # Using HIGHEST_PROTOCOL is almost 2X faster and creates a file that
    # is ~10% smaller.  Load times go down by a factor of about 3X.
    def save(self, filename='DataContainer.pkl'):
        if self.isGZIP(filename):
            f = gzip.open(filename, 'wb')
        else:
            f = open(filename, 'wb')
        pickle.dump(self, f, protocol=pickle.HIGHEST_PROTOCOL)
        f.close()

    # Note that loading to a string with pickle.loads is about 10% faster
    # but probaly comsumes a lot more memory so we'll skip that for now.
    @classmethod
    def load(cls, filename='DataContainer.pkl'):
        if cls.isGZIP(filename):
            f = gzip.open(filename, 'rb')
        else:
            f = open(filename, 'rb')
        n = pickle.load(f)
        f.close()
        return n

Depending on your use case you could use this as described at the top, as a base class, or simply copy the pickle.dump line into your code.

If you really have a lot of data and you don't use all of it with every run of your test program, there are a few other options such a database but the above is about the best simple option assuming you need most of the data with each run.

bivouac0
  • 2,494
  • 1
  • 13
  • 28
  • Thank you. It seems like I wasn't asking the right question. I don't necessarily need to save any result. I'm looking for methods that read txt files (which contains big tables) faster than `np.loadtxt`. Do you think this cpickle method is faster? – F.S. Jan 19 '18 at 04:53
  • `pickle` won't read text files. It is it's own format. Basic python file ops and numpy are about the only options that exist for reading text files. If you're the one saving the data and you're wondering what format to use, try the code above and see which is quicker. I haven't done a comparison. – bivouac0 Jan 19 '18 at 05:23
  • yeah, I can save my data into this format and read from here instead of txt files. I will give it a try and keep you posted – F.S. Jan 19 '18 at 05:26
3

I have not tested the speed and storage efficacy for the following solution. HDF5 does support 'compound datatypes' that can be used with numpy 'structured arrays' which support mixed variables types such as encountered in your class object.

"""
Created on Tue Dec 10 21:26:54 2019

@author: Christopher J. Burke
Give a worked example of saving a list of class objects with mixed
storage types to a HDF5 file and reading in file back to a list of class
objects.  The solution is inspired by this bug report
https://github.com/h5py/h5py/issues/735
and the numpy and hdf5 documentation
"""

import numpy as np
import h5py

class test_object:
    """ Define a storage class that keeps info that we want to record
      for every object
    """
    # explictly state the name, datatype and shape for every
    #  class variable
    #  The names MUST exactly match the class variable names in the __init__
    store_names = ['a', 'b', 'c', 'd', 'e']
    store_types = ['i8', 'i4', 'f8', 'S80', 'f8']
    store_shapes = [None, None, None, None, [4]]
    # Make the tuples that will define the numpy structured array
    # https://docs.scipy.org/doc/numpy/user/basics.rec.html
    sz = len(store_names)
    store_def_tuples = []
    for i in range(sz):
        if store_shapes[i] is not None:
            store_def_tuples.append((store_names[i], store_types[i], store_shapes[i]))
        else:
            store_def_tuples.append((store_names[i], store_types[i]))
    # Actually define the numpy structured/compound data type
    store_struct_numpy_dtype = np.dtype(store_def_tuples)

    def __init__(self):
        self.a = 0
        self.b = 0
        self.c = 0.0
        self.d = '0'
        self.e = [0.0, 0.0, 0.0, 0.0]

    def store_objlist_as_hd5f(self, objlist, fileName):
        """Function to save the class structure into hdf5
        objlist -  is a list of the test_objects
        fileName - is the h5 filename for output
        """        
        # First create the array of numpy structered arrays
        np_dset = np.ndarray(len(objlist), dtype=self.store_struct_numpy_dtype)
        # Convert the class variables into the numpy structured dtype
        for i, curobj in enumerate(objlist):
            for j in range(len(self.store_names)):
                np_dset[i][self.store_names[j]] = getattr(curobj, self.store_names[j])
        # Data set should be all loaded ready to write out
        fp = h5py.File(fileName, 'w')
        hf_dset = fp.create_dataset('dset', shape=(len(objlist),), dtype=self.store_struct_numpy_dtype)
        hf_dset[:] = np_dset
        fp.close()

    def fill_objlist_from_hd5f(self, fileName):
        """ Function to read in the hdf5 file created by store_objlist_as_hdf5
          and store the contents into a list of test_objects
          fileName - si the h5 filename for input
         """
        fp = h5py.File(fileName, 'r')
        np_dset = np.array(fp['dset'])
        # Start with empty list
        all_objs = []
        # iterate through the numpy structured array and save to objects
        for i in range(len(np_dset)):
            tmp = test_object()
            for j in range(len(self.store_names)):
                setattr(tmp, self.store_names[j], np_dset[i][self.store_names[j]])
            # Append object to list
            all_objs.append(tmp)
        return all_objs

if __name__ == '__main__':

    all_objs = []    
    for i in range(3):
        # instantiate tce_seed object
        tmp = test_object()
        # Put in some dummy data into object
        tmp.a = int(i)
        tmp.b = int(i)
        tmp.c = float(i)
        tmp.d = '{0} {0} {0} {0}'.format(i)
        tmp.e = np.full([4], i, dtype=np.float)
        all_objs.append(tmp)

    # Write out hd5 file
    tmp.store_objlist_as_hd5f(all_objs, 'test_write.h5')

    # Read in hd5 file
    all_objs = []
    all_objs = tmp.fill_objlist_from_hd5f('test_write.h5')

    # verify the output is as expected
    for i, curobj in enumerate(all_objs):
        print('Object {0:d}'.format(i))
        print('{0:d} {1:d} {2:f}'.format(curobj.a, curobj.b, curobj.c))
        print('{0} {1}'.format(curobj.d.decode('ASCII'), curobj.e))

C_J_Burke
  • 31
  • 2
  • Thanks for answering my question two years ago! This seems like your first answer on stackoverflow. Have fun! – F.S. Dec 11 '19 at 19:30
  • This is great and almost exactly what I'm looking for! However, in my real case my `test_object.__init__` takes arguments. So when I get to `fill_objlist_from_hdf5` the `tmp = test_object()` fails because I don't know what value to provide, not having read the object yet! Any advice? – evanb Nov 06 '22 at 16:26