I have a very large dataset I write to hdf5 in chunks via append like so:
with pd.HDFStore(self.train_store_path) as train_store:
for filepath in tqdm(filepaths):
with open(filepath, 'rb') as file:
frame = pickle.load(file)
if frame.empty:
os.remove(filepath)
continue
try:
train_store.append(
key='dataset', value=frame,
min_itemsize=itemsize_dict)
os.remove(filepath)
except KeyError as e:
print(e)
except ValueError as e:
print(frame)
print(e)
except Exception as e:
print(e)
The data is far too large to load into one DataFrame, so I would like to try out vaex for further processing. There's a few things I don't really understand though.
Since vaex uses a different representation in hdf5 than pandas/pytables (VOTable), I'm wondering how to go about converting between those two formats. I tried loading the data in chunks into pandas, converting it to a vaex DataFrame and then storing it, but there seems to be no way to append data to an existing vaex hdf5 file, at least none that I could find.
Is there really no way to create a large hdf5 dataset from within vaex? Is the only option to convert an existing dataset to vaex' representation (constructing the file via a python script or TOPCAT)?
Related to my previous question, if I work with a large dataset in vaex out-of-core, is it possible to then persist the results of any transformations i apply in vaex into the hdf5 file?