0

The topic of 'how to store numpy' in sqlite has been discussed a couple of times. I think there are mainly two solutions:

Now let us suppose the numpy array is huge and only a slice of it is needed from the database. What is the best approach to select and retrieve the sliced data?

SELECT substr(data,start,length)

This will work for as long as data is BLOB and the numpy type np.int/uint8 was stored without the numpy data frame into BLOB. What about other data types such as np.uint64?

Of course, it is possible to store the numpy type in the sqlite database, too. Then any sliced-data request would need to adapt the SELECT request and the start/length information accordingly, i.e., scale it by the number of bytes of the respective data type.

Is there a better way to do this?

user26372
  • 129
  • 1
  • 2
  • 14
  • For large arrays, this is not a good idea to embed the Numpy array in the SQLite database. This makes the DB requests far less efficient (most SQL databases are not optimized for that) and harder to migrate. I think the best is to store a reference to an external file (possibly a unique one regarding your needs). This also improve/fix the performance of slicing. – Jérôme Richard Jun 27 '22 at 23:17
  • 1
    While this can certainly be done, it is probably a bit cumbersome as you need strides information as well as the datatype to extract a sliced data, unless you have 1D data. Also, you can certain peek here (even if it is C++, the idea is the same): https://stackoverflow.com/questions/3005231/how-to-store-array-in-one-column-in-sqlite3 – norok2 Jun 28 '22 at 08:14
  • Would "huge" mean that it would not fit into memory, or just some big array you do not really want to load entirely? – norok2 Jun 28 '22 at 14:43
  • each entry could be 1M to 10M data points (int8 or int16) ... and sometimes I'm only interested in, e.g., 50 000 data points. of course, I could use HDF5, but these files get corrupted easily when creating them, are not as easy to use in terms of parallel threads, and I think are somewhat more limited when it comes to multi-dimensional data and searching (since no SQL-like querying exists) – user26372 Jun 28 '22 at 17:26
  • What about using [parallel HDF5](https://docs.h5py.org/en/stable/mpi.html) then? HDF5 seems clearly better suited for the mentioned usage. With a reference to an HDF5 entry you can benefit from the SQL request and fast slicing (without making your database slower -- considering SQLite is known to be already pretty slow). Note that if the extracted points are not contiguously stored, then it is should be faster to read the whole array (especially on HDD storages). – Jérôme Richard Jun 28 '22 at 19:52
  • because HDF5 files easily get corrupted ... it is a well known problem. it sucks. forgot to close a file handle? script crashes? oh nevermind the 1 TB of data you recorded beforehand because the whole file will be corrupted unless precautions have been taken (flushing the output after each write and such things). – user26372 Jun 28 '22 at 23:38
  • Have you considered [zarr](https://zarr.readthedocs.io/en/stable/index.html)? I have not used it much myself yet but it supports a [sqlite store](https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.SQLiteStore) among several other options. This [numpy issue](https://github.com/numpy/numpy/issues/15760) links to several good resources and has discussion from numpy maintainers. – webelo Jul 01 '22 at 19:13

0 Answers0