0

I'm trying to create a PyTables table to store 200000 * 200000 matrix in it. I try this code:

import tables
columns = {}
for x in range (200000):
    columns['col' + str(x)] = tables.FloatCol()
h5f = tables.open_file('matrix1.h5', 'w')
tbl = h5f.create_table('/', 'matrix', columns)
h5f.close()

, but it fails with this traceback:

  File "/home/nick/tests0/reg/create_tables.py", line 18, in <module>
    tbl = h5f.create_table('/', 'matrix', columns)

  File "/home/nick/anaconda3/lib/python3.8/site-packages/tables/file.py", line 1053, in create_table
    ptobj = Table(parentnode, name,

  File "/home/nick/anaconda3/lib/python3.8/site-packages/tables/table.py", line 835, in __init__
    super(Table, self).__init__(parentnode, name, new, filters,

  File "/home/nick/anaconda3/lib/python3.8/site-packages/tables/leaf.py", line 286, in __init__
    super(Leaf, self).__init__(parentnode, name, _log)

  File "/home/nick/anaconda3/lib/python3.8/site-packages/tables/node.py", line 264, in __init__
    self._v_objectid = self._g_create()

  File "/home/nick/anaconda3/lib/python3.8/site-packages/tables/table.py", line 1022, in _g_create
    self._v_objectid = self._create_table(

  File "tables/tableextension.pyx", line 211, in tables.tableextension.Table._create_table

HDF5ExtError: Problems creating the table

What am I doing wrong here?

Nick L
  • 25
  • 5

1 Answers1

1

That's a big matrix (300GB if all ints). Likely you will have to write incrementally. (I don't have enough RAM on my system to do it all at one.)

Without seeing your data types, it's hard to give specific advice.
First question: do you really want to create a Table or will an Array suffice? PyTables has both types. What's the difference?
An Array holds homogeneous data (like a NumPy ndarray) and can have any dimension. An Table is typically used to hold heterogeneous data (like a NumPy recarray) and is always 2d (really a 1d array of structured types). Tables also support complex queries with the PyTables API.

The key when creating a Table is to either use the description= or obj= parameter to describe the structured types (and field names) for each row. I recently posted an answer that shows how to create a Table. Please review. You may find you don't want to create 200000 fields/columns to define the Table. See this answer: different data types for different columns of an array

If you just want to save a matrix of 200000x200000 homogeneous entities, an array is easier. (Given the data size, you probably need to use an EArray, so you can write the data in increments.) I wrote a simple example that creates an EArray with 2000x200000 entities, then adds 3 more sets of data (each 2000 rows; total of 8000 rows).

  • The shape=(0,nrows) parameter indicates the first axis can be extended, and creates ncols columns.
  • The expectedrows=nrows parameter is important in large datasets to improvie I/O performance.

The resulting HDF5 file is 6GB. Repeat earr.append(arr) 99 times to get 200000 rows. Code below:

import tables as tb
import numpy as np

nrows=200000
ncols=200000
arr = np.arange(2000*ncols).reshape(2000,ncols)
h5f = tb.File('matrix1.h5', 'w')
earr = h5f.create_earray('/', 'myarray', shape=(0,ncols), expectedrows=nrows, obj=arr)
earr.append(arr)
earr.append(arr)
earr.append(arr)

h5f.close()
kcw78
  • 7,131
  • 3
  • 12
  • 44