I recently had this post, where I was assisted in making a big matrix from two smaller matrices. The resulting matrix is correct and creating the multplied numpy array takes < 10min., however printing to a file takes a very long time (>7hrs). The final matrix is 108887x55482, with the file size being 12Gb when complete.
Can anyone assist with the following code to print the 'newmat' to an output file as tab delimited? I need the mat2 ids as the column headers, and the mat1 ids as row[0].
#!/usr/bin/env python
import numpy as np
print '\n######################################################################################'
print 'Generating matrix.'
print '########################################################################################\n'
print "Opening files, creating lists of lists for numpy arrays.."
def open_files(file):
with open(file, 'r') as f:
ids = []
vals = []
next(f)
for line in f:
ids.append(line.strip().split('\t')[0])
vals.append(line.strip().split('\t')[1:])
print len(ids)
print len(vals)
return ids, vals
mat1ids, mat1vals = open_files('matrix1.txt')
mat1ids, mat1vals = open_files('matrix2.txt')
print 'Total Mat1: ' + str(len(mat1ids))
print 'Total Mat2: ' + str(len(mat2ids)), '\n'
print 'Generating arrays..'
mh = np.int8(mat1vals)
mk = np.int8(mat2vals)
print 'Generating new matrix..'
newmat = mh.dot(mk.T)
print len(newmat)
print 'Printing results to outfile..'
with open('test_numpy_matrix.txt', 'w') as out:
print >> out, '\t', '\t'.join(mat2ids)
for i in range(len(mat1ids)):
print >> out, mat1ids[i], '\t', '\t'.join(str(x) for x in new[i])
print '\n######################################################################################'
print 'Matrix complete.'
print '########################################################################################\n'
Update np.savetxt is taking just as long as looping through each element in the array. I can put the mat2 ids as column headers with np.savetxt, but not add mat1 ids as row[0] in final matrix.