Read in n number of random columns in pandas

Question

I have a 80gb h5 file, and I want to just read say a random set of 1000 columns and assume I do not know the column names. How would we achieve this?

@AnuragDabas That would require loading the entire dataframe first. — justanewb, Jul 13 '21 at 03:19
I think this can help so have a look at [read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame](https://stackoverflow.com/questions/22258491/read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame/22259008#22259008) — Anurag Dabas, Jul 13 '21 at 03:26

score 1 · Accepted Answer · answered Jul 13 '21 at 05:00

You should first know the number of columns in your file. Let's assume 10000 here.

You can then use a combination of numpy.random and the columns option of pandas.read_hdf:

pd.read_hdf('file', columns=sorted(np.random.choice(range(10000), size=1000, replace=False)))

1 Answers1