0

I have a 80gb h5 file, and I want to just read say a random set of 1000 columns and assume I do not know the column names. How would we achieve this?

justanewb
  • 133
  • 4
  • 15
  • `df.sample(n=1000,axis=1)`? – Anurag Dabas Jul 13 '21 at 03:17
  • 1
    @AnuragDabas That would require loading the entire dataframe first. – justanewb Jul 13 '21 at 03:19
  • I think this can help so have a look at [read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame](https://stackoverflow.com/questions/22258491/read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame/22259008#22259008) – Anurag Dabas Jul 13 '21 at 03:26

1 Answers1

1

You should first know the number of columns in your file. Let's assume 10000 here.

You can then use a combination of numpy.random and the columns option of pandas.read_hdf:

pd.read_hdf('file', columns=sorted(np.random.choice(range(10000), size=1000, replace=False)))
mozway
  • 194,879
  • 13
  • 39
  • 75