0

I have a HDF5 file with three datasets; one containing names and the other two containing associated values. The datasets are large with almost 100,000,000 elements each. I'd like to print the top 300 name-value pairs to a file in tab-delimited format, however, I'm having an issue with implementing my solution.

I want to combine the three datasets into a three-dimensional numpy array, so that I can sort the values by the second column and pull out the top 300 rows. However, my program doesn't appear to be able to construct the three-dimensional numpy array, at least not in a reasonable runtime. My code is seen below.

#!/usr/bin/env python3

# Importing modules.
import h5py
import numpy as np

# Creating path for HDF5 file.
HDF5_PATH = ('/path/to/hdf5_file.hdf5')

# Creating path for outfile.
OUTFILE_PATH = ('/path/to/outfile.tsv')

# Loading HDF5 file.
hdf5_file = h5py.File(HDF5_PATH, 'r')

# Getting 3D array of datasets.
print('Building array')
hdf5_arr = np.array([hdf5_file['col_1'], hdf5_file['col_2'], hdf5_file['col_3']])

# Getting top 300 rows by second column.
print('Getting top 300 values')
top_300_arr = hdf5_arr[np.argpartition(hdf5_arr, axis=1)]

# Printing top 300 rows.
print('Printing top 300 values')
    with open(OUTFILE_PATH, 'a') as outfile:
        np.savetxt(outfile, top_300_arr, delimiter="\t", fmt='%s')

I've added print statements to monitor progress, and currently, my code prints out Building array and doesn't appear to progress for at least an hour. That would mean that my issues is with the line hdf5_arr = np.array([hdf5_file['col_1'], hdf5_file['col_2'], hdf5_file['col_3']]). Is there any way I can improve my code so that it can work in a suitable runtime?

J0HN_TIT0R
  • 323
  • 1
  • 13
  • What's the time for `x=hdf5_file['col_1'][...]`? That is, simply loading one dataset. – hpaulj Mar 14 '18 at 21:41
  • According to `timeit.default_timer()`, only 1.6437843441963196e-06 seconds. That must mean that there's something wring with how I'm building my array right? – J0HN_TIT0R Mar 14 '18 at 22:05
  • But if I do `x = np.array(hdf5_file['col_1'])`, it's 20.615581781603396, and `x = np.array([hdf5_file['col_1'], hdf5_file['col_1']])` doesn't seem like it will complete in a suitable runtime. – J0HN_TIT0R Mar 14 '18 at 22:23
  • 1
    I'd first load the 3 arrays. What's the shape and dtype of each? `np.array` loads them and then concatenates on a new initial axis. You might want to delay that until after selecting the top 300. – hpaulj Mar 14 '18 at 23:43
  • How about RAM usage? (Are you running out of physical RAM)? Shape and Chunkshape of your dataset? Insufficent chunk-cache can lead even to slow sequential reading for example: https://stackoverflow.com/a/48446301/4045774 – max9111 Mar 15 '18 at 08:50
  • Provided it fits all easily in memory, `hdf5_arr = np.array([hdf5_file['col_1'][:], hdf5_file['col_2'][:], hdf5_file['col_3'][:]])` will load the data as NumPy arrays directly and *then* process them. In some situations, indexing directly h5py datasets is slow and it is better to load the data in RAM. – Pierre de Buyl Mar 15 '18 at 12:15

0 Answers0