I have 185 files of data, which contains a total number of 30 million rows. Each two has two columns; a single int which I want to use as an index, and a list of 512 ints.
So it looks something like this
IndexID Ids
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131 [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716 [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098 [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...
The data is too large to load into memory, but I would like to retrieve say a couple hundred rows at a time using a list of indices.
I got advice from this comment to use Parquet. Most efficient way of saving a pandas dataframe or 2d numpy array into h5py, with each row a seperate key, using a column
I've been looking at the official parquet python guide
https://arrow.apache.org/docs/python/parquet.html
and
fast parquet guide
https://fastparquet.readthedocs.io/en/latest/api.html
But I can't seem to find to find any way to retrieve a row using an index, and if the table is stored on disk, or if it's all loaded into memory.
Is this possible? If so, how would I do something like this?
For example
ParquetTable[22861131, 15760716]
[0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1... [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...