0

I create an expandable earray of Nx4 columns. Some columns require float64 datatype, the others can be managed with int32. Is it possible to vary the data types among the columns? Right now I just use one (float64, below) for all, but it takes huge disk space for (>10 GB) files.

For example, how can I ensure column 1-2 elements are int32 and 3-4 elements are float64?

import tables
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float32Atom(), shape=(0, 4))

Here is a simplistic version of how I am appending using Earray:

Matrix = np.ones(shape=(10**6, 4))

if counter <= 10**6: # keep appending to Matrix until 10**6 rows
    Matrix[s:s+length, 0:4] = chunk2[left:right] # chunk2 is input np.ndarray
    s += length

# save to disk when rows = 10**6
if counter > 10**6:
    a.append(Matrix[:s])  
    del Matrix
    Matrix = np.ones(shape=(10**6, 4))

What are the cons for the following method?

import tables as tb
import numpy as np

filename = 'foo.h5'
f = tb.open_file(filename, mode='w')
int_app = f.create_earray(f.root, "col1", atom=tb.Int32Atom(), shape=(0,2), chunkshape=(3,2))
float_app = f.create_earray(f.root, "col2", atom=tb.Float64Atom(), shape=(0,2), chunkshape=(3,2))

# array containing ints..in reality it will be 10**6x2
arr1 = np.array([[1, 1],
                [2, 2],
                [3, 3]], dtype=np.int32)

# array containing floats..in reality it will be 10**6x2
arr2 = np.array([[1.1,1.2],
                 [1.1,1.2],
                 [1.1,1.2]], dtype=np.float64)

for i in range(3):
    int_app.append(arr1)
    float_app.append(arr2)

f.close()

print('\n*********************************************************')
print("\t\t Reading Now=> ")
print('*********************************************************')
c = tb.open_file('foo.h5', mode='r')
chunks1 = c.root.col1
chunks2 = c.root.col2
chunk1 = chunks1.read()
chunk2 = chunks2.read()
print(chunk1)
print(chunk2)
nuki
  • 101
  • 5

1 Answers1

1

No and Yes. All PyTables array types (Array, CArray, EArray, VLArray) are for homogeneous datatypes (similar to a NumPy ndarray). If you want to mix datatypes, you need to use a Table. Tables are extendable; they have an .append() method to add rows of data.

The creation process is similar to this answer (only the dtype is different): PyTables create_array fails to save numpy array. You only define the datatypes for a row. You don't define the shape or number of rows. That is implied as you add data to the table. If you already have your data in a NumPy recarray, you can reference it with the description= entry, and the Table will use the dtype for the table and populate with the data. More info here: PyTables Tables Class

Your code would look something like this:

import tables as tb
import numpy as np
table_dt = np.dtype(
           {'names': ['int1', 'int2', 'float1', 'float2'], 
            'formats': [int, int, float, float] } )
# Create some random data:
i1 = np.random.randint(0,1000, (10**6,) )
i2 = np.random.randint(0,1000, (10**6,) )
f1 = np.random.rand(10**6)
f2 = np.random.rand(10**6)

with tb.File('table.h5', 'w') as h5f:
    a = h5f.create_table('/', 'dataset_1', description=table_dt)

# Method 1 to create empty recarray 'Matrix', then add data:     
    Matrix = np.recarray( (10**6,), dtype=table_dt)
    Matrix['int1'] = i1
    Matrix['int2'] = i2
    Matrix['float1'] = f1
    Matrix['float2'] = f2        
# Append Matrix to the table
    a.append(Matrix)

# Method 2 to create recarray 'Matrix' with data in 1 step:       
    Matrix = np.rec.fromarrays([i1, i2, f1, f2], dtype=table_dt)
# Append Matrix to the table
    a.append(Matrix)

You mentioned creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some additional thoughts based on comments in another thread.

The .create_table() method has an optional parameter: expectedrows=. This parameter is used 'to optimize the HDF5 B-Tree and amount of memory used'. Default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation.) I highly suggest you set this to a larger value if you are creating 10**6 (or more) rows.

Also, you should consider file compression. There's a trade-off: compression reduces the file size, but will reduce I/O performance (increases access time). There are a few options:

  1. Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
  2. Use the HDF Group utility h5repack - run against a HDF5 file to create a new file (useful to go from uncompressed to compressed, or vice-versa).
  3. Use the PyTables utility ptrepack - works similar to h4repack and delivered with PyTables.

I tend to use uncompressed files I work with often for best I/O performance. Then when done, I convert to compressed format for long term archiving.

kcw78
  • 7,131
  • 3
  • 12
  • 44
  • So, now I have run into another problem. I was actually appending the whole numpy array whenever it reaches certain no. of rows (10^6) to minimize number of writes (increasing comp. eff.) as recommended in the SO post I referenced above. Since I can't have numpy array of mixed data type, may be I can have separate NP vectors that have different data types. But how can I append the complete vector at once to the table. Including a snippet of my code above. – nuki Aug 21 '20 at 18:54
  • You **can** create a numpy array of mixed data types (called a recarray or record array -- NOT the same as a ndarray). Using a recarray is the easiest way to append 10**6 rows at a time. I always add data to tables by row, never by column. I can modify my answer to show how how to create a recarray and add the data. There might be a way to do add by column with PyTables `Cols` methods. I will have to investigate that. – kcw78 Aug 21 '20 at 19:20
  • Oh, wasn't aware of that, thanks! Would appreciate if you can show how to append some rows at once using rec (like 10^6 rows). In that case, I will have 10^6 x 4 elements appended each time ...the 4 columns have different `dtypes` – nuki Aug 21 '20 at 19:33
  • There are many, many ways to do this. I added the 2 methods I use most often. Notice how `.create_table()` and the recarray creation both use the same `dtype=`?!? There's a method to the madness. ;) – kcw78 Aug 21 '20 at 19:59
  • That is so cool! Thanks for introducing me to recarray; I am accepting this as an answer! In the meantime, I also worked on something. Please see my edit above, are there any cons of creating separate earrays? For example, will it be memory intensive than your method for large appends like billions of rows by appending million each time? Asking out of curiosity, driven by madness! :-) – nuki Aug 21 '20 at 22:26
  • 1
    I don't know if there are any performance differences between Tables and EArrays. I suspect not. Better to ask Francsec on the PyTables forum. Personally, now that I've learned how to use recarrays with Tables, I prefer them. The big con (IMHO) is keeping track of data in 2 datasets. Not hard, just requires care. Other reasons: 1) field/column labels can be used to slice (and also describe your data), 2) Tables have search/sort/index functionality not available with array class objects. I only use arrays when my datasets make sense as Numpy ndarrays. – kcw78 Aug 22 '20 at 01:05
  • @nuki, I updated my answer to add some info about `expectedrows` parameter and compression filters that may improve performance/size for very large files. – kcw78 Aug 22 '20 at 21:02