4

I have a very large dataset I write to hdf5 in chunks via append like so:

with pd.HDFStore(self.train_store_path) as train_store:
    for filepath in tqdm(filepaths):
        with open(filepath, 'rb') as file:
            frame = pickle.load(file)

        if frame.empty:
            os.remove(filepath)
            continue

        try:
            train_store.append(
                key='dataset', value=frame,
                min_itemsize=itemsize_dict)
            os.remove(filepath)
        except KeyError as e:
            print(e)
        except ValueError as e:
            print(frame)
            print(e)
        except Exception as e:
            print(e) 

The data is far too large to load into one DataFrame, so I would like to try out vaex for further processing. There's a few things I don't really understand though.

Since vaex uses a different representation in hdf5 than pandas/pytables (VOTable), I'm wondering how to go about converting between those two formats. I tried loading the data in chunks into pandas, converting it to a vaex DataFrame and then storing it, but there seems to be no way to append data to an existing vaex hdf5 file, at least none that I could find.

Is there really no way to create a large hdf5 dataset from within vaex? Is the only option to convert an existing dataset to vaex' representation (constructing the file via a python script or TOPCAT)?

Related to my previous question, if I work with a large dataset in vaex out-of-core, is it possible to then persist the results of any transformations i apply in vaex into the hdf5 file?

sobek
  • 1,386
  • 10
  • 28

1 Answers1

8

The problem with this storage format is that it is not column-based, which does not play well with datasets with large number of rows, since if you only work with 1 column, for instance, the OS will probably also read large portions of the other columns, as well as the CPU cache gets polluted with it. It would be better to store them to a column based format such as vaex' hdf5 format, or arrow.

Converting to a vaex dataframe can done using:

import vaex
vaex_df = vaex.from_pandas(pandas_df, copy_index=False)

You can do this for each dataframe, and store them on disk as hdf5 or arrow:

vaex_df.export('batch_1.hdf5')  # or 'batch_1.arrow'

If you do this for many files, you can lazily (i.e. no memory copies will be made) concatenate them, or use the vaex.open function:

df1 = vaex.open('batch_1.hdf5')
df2 = vaex.open('batch_2.hdf5')
df = vaex.concat([df1, df2]) # will be seen as 1 dataframe without mem copy
df_altnerative = vaex.open('batch*.hdf5')  # same effect, but only needs 1 line

Regarding your question about the transformations:

If you do transformations to a dataframe, you can write out the computed values, or get the 'state', which includes the transformations:

import vaex
df = vaex.example()
df['difference'] = df.x - df.y
# df.export('materialized.hdf5', column_names=['difference'])  # do this if IO is fast, and memory abundant
# state = df.state_get()  # get state in memory
df.state_write('mystate.json') # or write as json


import vaex
df = vaex.example()
# df.join(vaex.open('materialized.hdf5'))  # join on rows number (super fast, 0 memory use!)
# df.state_set(state)  # or apply the state from memory
df.state_load('mystate.json')  # or from disk
df
Maarten Breddels
  • 1,344
  • 10
  • 12
  • Thanks, that was quite helpful. I'm still not perfectly sure I understand the state thing, though. If you work with hdf5 and you want to persist results to disk (not just a pipeline but the actual results) you can only write the hdf5 at once, so you have to fit it all into RAM? The state is a pipeline of operations you performed to the data, right? Since i want to make a pipeline that does preprocessing on data for ML and at some point I want to again have the transformed data on disk, I'm starting to think that vaex might not be the right tool for this exact application. – sobek Dec 11 '19 at 12:07
  • 2
    It will not have to fit in RAM, it will be exported in 'chunks', although the end result on disk is one contiguous array. I think what you want can be done with vaex, but this is too short to answer, maybe open an new question? – Maarten Breddels Dec 11 '19 at 13:05
  • 1
    Thanks, I guess i misread the comment about memory being abundant. I'll try some stuff and open a question once I face a concrete obstacle. – sobek Dec 11 '19 at 13:48
  • 1
    The pandas dataframe will have to fit in RAM though. Also my `vaex_df.export('batch_1.hdf5')` never completes in jupyterlab (even on small file) and the bit that does get written out is unasable. Also can't read hdf5 written by pandas. Is there no add row by row to dataframe for vaex so that I can loop through JSON? – Superdooperhero Mar 17 '21 at 09:32