Saving in a file an array or DataFrame together with other information

Question

The statistical software Stata allows short text snippets to be saved within a dataset. This is accomplished either using notes and/or characteristics.

This is a feature of great value to me as it allows me to save a variety of information, ranging from reminders and to-do lists to information about how I generated the data, or even what the estimation method for a particular variable was.

I am now trying to come up with a similar functionality in Python 3.6. So far, I have looked online and consulted a number of posts, which however do not exactly address what I want to do.

A few reference posts include:

For a small NumPy array, I have concluded that a combination of the function numpy.savez() and a dictionary can store adequately all relevant information in a single file.

For example:

a = np.array([[2,4],[6,8],[10,12]])
d = {"first": 1, "second": "two", "third": 3}

np.savez(whatever_name.npz, a=a, d=d)
data = np.load(whatever_name.npz)

arr = data['a']
dic = data['d'].tolist()

However, the question remains:

Are there better ways to potentially incorporate other pieces of information in a file containing a NumPy array or a (large) Pandas DataFrame?

I am particularly interested in hearing about the particular pros and cons of any suggestions you may have with examples. The fewer dependencies, the better.

jpp · Accepted Answer · 2019-11-08T11:00:20.413

8

There are many options. I will discuss only HDF5, because I have experience using this format.

Advantages: Portable (can be read outside of Python), native compression, out-of-memory capabilities, metadata support.

Disadvantages: Reliance on single low-level C API, possibility of data corruption as a single file, deleting data does not reduce size automatically.

In my experience, for performance and portability, avoid pyTables / HDFStore to store numeric data. You can instead use the intuitive interface provided by h5py.

Store an array

import h5py, numpy as np

arr = np.random.randint(0, 10, (1000, 1000))

f = h5py.File('file.h5', 'w', libver='latest')  # use 'latest' for performance

dset = f.create_dataset('array', shape=(1000, 1000), data=arr, chunks=(100, 100),
                        compression='gzip', compression_opts=9)

Compression & chunking

There are many compression choices, e.g. blosc and lzf are good choices for compression and decompression performance respectively. Note gzip is native; other compression filters may not ship by default with your HDF5 installation.

Chunking is another option which, when aligned with how you read data out-of-memory, can significantly improve performance.

Add some attributes

dset.attrs['Description'] = 'Some text snippet'
dset.attrs['RowIndexArray'] = np.arange(1000)

Store a dictionary

for k, v in d.items():
    f.create_dataset('dictgroup/'+str(k), data=v)

Out-of-memory access

dictionary = f['dictgroup']
res = dictionary['my_key']

There is no substitute for reading the h5py documentation, which exposes most of the C API, but you should see from the above there is a significant amount of flexibility.

edited Nov 08 '19 at 11:00

answered Apr 23 '18 at 23:12

jpp

159,742
34
281
339

A few remarks: 1) The blosc will be much faster for compressing data. 2) If Many dsets are created and backwards compability isn't a big issue you f = h5py.File('name.hdf5', libver='latest') can be used. This should improve speed quite a bit (small, but many dsets) – max9111 Apr 25 '18 at 11:35
@max9111, (1) Is `blosc` native / portable outside of Python, e.g. you can view in HDFView / other libraries? (2) Thank you, I'll update with `libver` argument. – jpp Apr 25 '18 at 11:37
1

The Blosc-filter can be installed system wide. https://github.com/Blosc/hdf5-blosc I would mention it because it can be a magnitude faster than gzip, but you are right it isn't installed by a default HDF5 installation. – max9111 Apr 25 '18 at 11:50
@max9111 I am not familiar with blosc filter. Does it just apply some filtering, or is this just a compression library? In pandas.DataFrame.to_hdf() there is an option for "blosc" under complib (i.e. built in, no need for additional packages), which is just compression. Same thing? – tnknepp Apr 25 '18 at 12:13
1

Yes it is a package of compression and shuffle filters. For comparison you can try the following example with gzip and compare the performance. https://stackoverflow.com/questions/48672130/saving-to-hdf5-is-very-slow-python-freezing/48997927#48997927 (In this example I used pytables to register the blosc filters, but this should also work with globally installed blosc filters, mentioned in the comment above) – max9111 Apr 25 '18 at 12:37

score 1 · Answer 2 · answered Apr 24 '18 at 09:46

A practical way could be to embed meta-data directly inside the Numpy array. The advantage is that, as you'd like, there's no extra dependency and it's very simple to use in the code. However, this doesn't fully answers your question, because you still need a mechanism to save the data, and I'd recommend using jpp's solution using HDF5.

To include metadata in an ndarray, there is an example in the documentation. You basically have to subclass an ndarray and add a field info or metadata or whatever.

It would give (code from the link above)

import numpy as np

class ArrayWithInfo(np.ndarray):

    def __new__(cls, input_array, info=None):
        # Input array is an already formed ndarray instance
        # We first cast to be our class type
        obj = np.asarray(input_array).view(cls)
        # add the new attribute to the created instance
        obj.info = info
        # Finally, we must return the newly created object:
        return obj

    def __array_finalize__(self, obj):
        # see InfoArray.__array_finalize__ for comments
        if obj is None: return
        self.info = getattr(obj, 'info', None)

To save the data through numpy, you'd need to overload the write function or use another solution.

score 0 · Answer 3 · answered Apr 21 '18 at 02:47

It's an interesting question, although very open-ended I think.

Text Snippets
For text snippets that have literal notes (as in, not code and not data), I really don't know what your use case is, but I don't see why I would deviate from using the usual with open() as f: ...

Small collections of various data pieces
Sure, your npz works. Actually what you are doing is very similar to creating a dictionary with everything you want to save and pickling that dictionary.

See here for a discussion of the differences between pickle and npz (but mainly, npz is optimized for numpy arrays).

Personally, I'd say if you are not storing Numpy arrays I would use pickle, and even implement a quick MyNotes class that is basically a dictionary to save stuff in it, with some additional functionality you may want.

Collection of large objects
For really big np.arrays or dataframes I have used before the HDF5 format. The good thing is that it is already built in into pandas and you can directly df.to_hdf5(). It does need underneath pytables -installation should be fairly painless with pip or conda- but using pytables directly can be a much bigger pain.

Again, this idea is very similar: you are creating an HDFStore, which is pretty much a big dictionary in which you can store (almost any) objects. The benefit is that the format utilizes space in a smarter way by leveraging repetition of similar values. When I was using it to store some ~2GB dataframes, it was able to reduce it by almost a full order of magnitude (~250MB).

One last player: feather
Feather is a project created by Wes McKinney and Hadley Wickham on top of the Apache Arrow framework, to persist data in a binary format that is language agnostic (and therefore you can read from R and Python). However, it is still under development, and last time I checked they didn't encourage to use it for long-term storage (since the specification may change in future versions), rather than just use it for communication between R and Python.

They both just launched Ursalabs, literally just weeks ago, that will continue growing this and similar initiatives.

score 0 · Answer 4 · answered Apr 24 '18 at 17:37

You stated as the reasons for this question:

... it allows me to save a variety of information, ranging from reminders and to-do lists, to information about how i generated the data, or even what the estimation method for a particular variable was.

May I suggest a different paradigm than that offered by Stata? The notes and characteristics seem to be very limited and confined to just text. Instead, you should use Jupyter Notebook for your research and data analysis projects. It provides such a rich environment to document your workflow and capture details, thoughts and ideas as you are doing your analysis and research. It can easily be shared, and it's presentation-ready.

Here is a gallery of interesting Jupyter Notebooks across many industries and disciplines to showcase the many features and use cases of notebooks. It may expand your horizons beyond trying to devise a way to tag simple snippets of text to your data.

score 0 · Answer 5 · answered Apr 25 '18 at 12:08

I agree with JPP that hdf5 storage is a good option here. The difference between his solution and mine is mine uses Pandas dataframes instead of numpy arrays. I prefer the dataframe since this allows mixed types, multi-level indexing (even datetime indexing, which is VERY important for my work), and column labeling, which helps me remember how different datasets are organized. Also, Pandas provides a slew of built-in functionalities (much like numpy). Another benefit of using Pandas is it has a hdf creator built in (i.e. pandas.DataFrame.to_hdf), which I find convenient

When storing the dataframe to h5 you have the option of storing a dictionary of metadata as well, which can be your notes to self, or actual metadata that does not need to be stored in the dataframe (I use this for setting flags as well, e.g. {'is_agl': True, 'scale_factor': 100, 'already_corrected': False, etc.}. In this regard, there is no difference between using a numpy array and a dataframe. For the full solution see my original question and solution here.

score 0 · Answer 6 · answered Apr 25 '18 at 18:44

jpp's answer is pretty comprehensive, just wanted to mention that as of pandas v22 parquet is very convenient and fast option with almost no drawbacks vs csv (accept perhaps the coffee break).

read parquet

write parquet

At time of writing you'll need to also

pip install pyarrow

In terms of adding information you have the metadata which is attached to the data

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.normal(size=(1000, 10)))

tab = pa.Table.from_pandas(df)

tab = tab.replace_schema_metadata({'here' : 'it is'})

pq.write_table(tab, 'where_is_it.parq')

pq.read_table('where_is_it.parq')

which then yield a table

Pyarrow table
0: double
1: double
2: double
3: double
4: double
5: double
6: double
7: double
8: double
9: double
__index_level_0__: int64
metadata
--------
{b'here': b'it is'}

To get this back to pandas:

tab.to_pandas()

Saving in a file an array or DataFrame together with other information

6 Answers6

Linked