I have a 80gb h5 file, and I want to just read say a random set of 1000 columns and assume I do not know the column names. How would we achieve this?
Asked
Active
Viewed 45 times
0
-
`df.sample(n=1000,axis=1)`? – Anurag Dabas Jul 13 '21 at 03:17
-
1@AnuragDabas That would require loading the entire dataframe first. – justanewb Jul 13 '21 at 03:19
-
I think this can help so have a look at [read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame](https://stackoverflow.com/questions/22258491/read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame/22259008#22259008) – Anurag Dabas Jul 13 '21 at 03:26
1 Answers
1
You should first know the number of columns in your file. Let's assume 10000 here.
You can then use a combination of numpy.random and the columns
option of pandas.read_hdf:
pd.read_hdf('file', columns=sorted(np.random.choice(range(10000), size=1000, replace=False)))

mozway
- 194,879
- 13
- 39
- 75