Most efficient way of saving a pandas dataframe or 2d numpy array into h5py, with each row a seperate key, using a column

Question

This is a follow up to this stackoverflow question

Column missing when trying to open hdf created by pandas in h5py

Where I am trying to create save a large amount of data onto a disk (too large to fit into memory), and retrieve sepecific rows of the data using indices.

One of the solutions given in the linked post is to create a seperate key for every every row.

At the moment I can only think of iterating through each row, and setting the keys directly.

For example, if this is my data

IndexID Ids
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131    [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716    [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098    [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...

I can go throw say my dataframe and set each row like this

f.create_dataset(str(row['IndexID']), data=row['Ids'])

I am wondering if there is a batch way to do this.

saving as a parquet columnar format is much easy case of reading and saving — Rajith Thennakoon, May 10 '20 at 03:50
Is this something I can do from pandas? I see that it has to_parquet, but I don't see options to index and read from the index. I am also looking at this https://arrow.apache.org/docs/python/parquet.html . But I don't see anything about retrieval using an index. Is there a particular keyword for this operation in parquet? — SantoshGupta7, May 10 '20 at 04:00

Most efficient way of saving a pandas dataframe or 2d numpy array into h5py, with each row a seperate key, using a column

0 Answers0

Linked