I have a HDF5 file with three datasets; one containing names and the other two containing associated values. The datasets are large with almost 100,000,000 elements each. I'd like to print the top 300 name-value pairs to a file in tab-delimited format, however, I'm having an issue with implementing my solution.
I want to combine the three datasets into a three-dimensional numpy array, so that I can sort the values by the second column and pull out the top 300 rows. However, my program doesn't appear to be able to construct the three-dimensional numpy array, at least not in a reasonable runtime. My code is seen below.
#!/usr/bin/env python3
# Importing modules.
import h5py
import numpy as np
# Creating path for HDF5 file.
HDF5_PATH = ('/path/to/hdf5_file.hdf5')
# Creating path for outfile.
OUTFILE_PATH = ('/path/to/outfile.tsv')
# Loading HDF5 file.
hdf5_file = h5py.File(HDF5_PATH, 'r')
# Getting 3D array of datasets.
print('Building array')
hdf5_arr = np.array([hdf5_file['col_1'], hdf5_file['col_2'], hdf5_file['col_3']])
# Getting top 300 rows by second column.
print('Getting top 300 values')
top_300_arr = hdf5_arr[np.argpartition(hdf5_arr, axis=1)]
# Printing top 300 rows.
print('Printing top 300 values')
with open(OUTFILE_PATH, 'a') as outfile:
np.savetxt(outfile, top_300_arr, delimiter="\t", fmt='%s')
I've added print statements to monitor progress, and currently, my code prints out Building array
and doesn't appear to progress for at least an hour. That would mean that my issues is with the line hdf5_arr = np.array([hdf5_file['col_1'], hdf5_file['col_2'], hdf5_file['col_3']])
. Is there any way I can improve my code so that it can work in a suitable runtime?