1

I have a class like this:

class C:
     def __init__(self, id, user_id, photo):
         self.id = id
         self.user_id = user_id
         self.photo = photo

I need to create millions of these objects. id is an integer as well as user_id but photo is a bool array of size 64. My boss wants me to store all of them inside hdf5 files. I also need to be able to make queries according to their user_id attributes to get all of the photos that have the same user_id. Firstly, how do I store them? Or even can I? And secondly, once I store(if I can) them how do I query them? Thank you.

A Ef
  • 13
  • 2
  • 8

1 Answers1

3

Although you can store the whole data structure in a single HDF5 table, it is probably much easier to store the described class as three separate variables - two 1D arrays of integers and a data structure for storing your 'photo' attribute.

If you care about file size and speed and do not care about human-readability of your files, you can model your 64 bool values either as 8 1D arrays of UINT8 or a 2D array N x 8 of UINT8 (or CHARs). Then, you can implement a simple interface that would pack your bool values into bits of UINT8 and back (e.g., How to convert a boolean array to an int array)

As far as know, there are no built-in search functions in HDF5, but you can read in the variable containing user_ids and then simply use Python to find indexes of all elements matching your user_id.

Once you have the indexes, you can read in the relevant slices of your other variables. HDF5 natively supports efficient slicing, but it works on ranges, so you might want to think how to store records with the same user_id in continuous chunks, see discussion over here

h5py: Correct way to slice array datasets

You might also want to look into pytables - a python interace that builds over hdf5 to store data in table-like strucutres.

import numpy as np
import h5py


class C:
    def __init__(self, id, user_id, photo):
        self.id = id
        self.user_id = user_id
        self.photo = photo

def write_records(records, file_out):

    f = h5py.File(file_out, "w")

    dset_id = f.create_dataset("id", (1000000,), dtype='i')
    dset_user_id = f.create_dataset("user_id", (1000000,), dtype='i')
    dset_photo = f.create_dataset("photo", (1000000,8), dtype='u8')
    dset_id[0:len(records)] = [r.id for r in records]
    dset_user_id[0:len(records)] = [r.user_id for r in records]
    dset_photo[0:len(records)] = [np.packbits(np.array(r.photo, dtype='bool').astype(int)) for r in records]
    f.close()

def read_records_by_id(file_in, record_id):
    f = h5py.File(file_in, "r")
    dset_id = f["id"]
    data = dset_id[0:2]
    res = []
    for idx in np.where(data == record_id)[0]:
        record = C(f["id"][idx:idx+1][0], f["user_id"][idx:idx+1][0], np.unpackbits( np.array(f["photo"][idx:idx+1][0],  dtype='uint8') ).astype(bool))
        res.append(record)
    return res 

m = [ True, False,  True,  True, False,  True,  True,  True]
m = m+m+m+m+m+m+m+m
records = [C(1, 3, m), C(34, 53, m)]

# Write records to file
write_records(records, "mytestfile.h5")

# Read record from file
res = read_records_by_id("mytestfile.h5", 34)

print res[0].id
print res[0].user_id
print res[0].photo
Engineero
  • 12,340
  • 5
  • 53
  • 75
Maksym
  • 1,430
  • 1
  • 11
  • 13
  • Thank you very much for your help. Although the info you gave is very important, there is no way I can store my data in continuous chunks due to the writing actions we'll have. I guess I shall have to use mongodb for storage. – A Ef Aug 25 '16 at 08:52
  • You should really try to fill an HDF5 file and see if the performance is acceptable. For your type of data, it might indeed make sense to use mongo, just remember there is a 16MB cap on what you can store in a single collection. – Maksym Aug 25 '16 at 11:06
  • @Maksym I have similar use case but in place of photo attribute I have numpy ndarray of shape(1024, 768, 3) and that is variable, I care about speed and time but do not care about readability, what do you suggest? – Avoid Apr 15 '18 at 09:33
  • @Avoid Are you limited by read time, write time, or query time? If you have a constant stream of a lot of data, it probably makes sense to look into parquet (http://parquet.apache.org/documentation/latest/) or some other format that supports streaming. Also, as long as your shape(1024, 768, 3) array is the same type as user_id and other attributes - you can store a complete record as a flat 1D array and reshape back at read time. – Maksym Apr 16 '18 at 13:50
  • Also, to correct my earlier comment - the 16MB cap in Mongo on a single record in a collection, not the collection itself. – Maksym Apr 16 '18 at 13:51