Python-performance-print large numpy array as strings to tab file

Question

I recently had this post, where I was assisted in making a big matrix from two smaller matrices. The resulting matrix is correct and creating the multplied numpy array takes < 10min., however printing to a file takes a very long time (>7hrs). The final matrix is 108887x55482, with the file size being 12Gb when complete.

Can anyone assist with the following code to print the 'newmat' to an output file as tab delimited? I need the mat2 ids as the column headers, and the mat1 ids as row[0].

#!/usr/bin/env python

import numpy as np

print '\n######################################################################################'
print 'Generating matrix.'
print '########################################################################################\n'


print "Opening files, creating lists of lists for numpy arrays.."

def open_files(file):
        with open(file, 'r') as f:
                ids = []
                vals = []
                next(f)
                for line in f:
                        ids.append(line.strip().split('\t')[0])
                        vals.append(line.strip().split('\t')[1:])
                print len(ids)
                print len(vals)
        return ids, vals

mat1ids, mat1vals = open_files('matrix1.txt')
mat1ids, mat1vals = open_files('matrix2.txt')

print 'Total Mat1: ' + str(len(mat1ids))
print 'Total Mat2: ' + str(len(mat2ids)), '\n'
print 'Generating arrays..'

mh = np.int8(mat1vals)
mk = np.int8(mat2vals)

print 'Generating new matrix..'
newmat = mh.dot(mk.T)

print len(newmat)

print 'Printing results to outfile..'

with open('test_numpy_matrix.txt', 'w') as out:
        print >> out, '\t', '\t'.join(mat2ids)
        for i in range(len(mat1ids)):
                print >> out, mat1ids[i], '\t', '\t'.join(str(x) for x in new[i])

print '\n######################################################################################'
print 'Matrix complete.'
print '########################################################################################\n'

Update np.savetxt is taking just as long as looping through each element in the array. I can put the mat2 ids as column headers with np.savetxt, but not add mat1 ids as row[0] in final matrix.

Is tab delimited output really needed? This is not a very efficient way to store large matrices. I would suggest to dump the resulting matrix with pytables into a hdf5 file which also allows slicing if you don't need the full matrix later but submatrices. — Bort, Jun 07 '16 at 16:22
`np.savetxt` iterates on the rows of the array, and does (more or less) `f.write(fmt % tuple(row))`, where `fmt` has a `%` format for each column and delimiters. For 'normal' files that's sufficient. — hpaulj, Jun 07 '16 at 16:46
@Bort - I would like the output to be in tab delimited format, where I have scripts that don't take long to process some of the information I would like to get out of the file. hdf5 is a type of database? — st.ph.n, Jun 07 '16 at 17:02
@hpaulj, i see how np.savetxt allows for printing the matrix in tab format, however, how can I get the column ids and row ids if I have them in an original list for each set? — st.ph.n, Jun 07 '16 at 17:12
np.savetxt('test.out.txt', newmat, delimiter='\t', fmt='%s') gets me what I need without the row/col ids. — st.ph.n, Jun 07 '16 at 17:16
is it possible to get the row/col ids with np.savetxt with the matrix in dbtype int8? int8 was used for memory purposes, otherwise storing the whole matrix in memory (as strings in lists)gets killed. — st.ph.n, Jun 07 '16 at 17:53
@user3358205 hdf5 is a file format, defined [here](https://www.hdfgroup.org/) and [pytables](http://www.pytables.org/) is a python interface for it. It can be used as a data base. In context of storing large information especially matrices it is highly efficient. It allows numpy like syntax for accessing submatrices and allows for your defined data types (including platform independent interpretion). — Bort, Jun 07 '16 at 17:53

score 1 · Answer 1 · answered Jun 07 '16 at 20:59

For text output there's also the Numpy array tofile method. Here's a quick benchmark:

import numpy as np

data = np.random.randint(49, size=(55000))
f = open('test.txt', 'w')

print "original:"
%timeit f.write('\t'.join(str(x) for x in data))
print ".tofile text mode:"
%timeit data.tofile(f, '\t')

And the output:

original:
10 loops, best of 3: 192 ms per loop
.tofile text mode:
10 loops, best of 3: 27.2 ms per loop

So a nice little speedup. Then your loop would look something like this:

with open('test_numpy_matrix.txt', 'w') as out:
    print >> out, '\t', '\t'.join(mat2ids)
    for i in range(len(mat1ids)):
        out.write(mat1ids[i] + '\t')
        new[i].tofile(out, '\t')
        out.write('\n')

On the other hand, a binary file format would probably be another order of magnitude faster, with a 3x smaller file size. Just try numpy.save on the full new array and see what kind of speed you get. Maybe store the row and column id's in (a) separate file(s)?

score 1 · Answer 2 · edited May 23 '17 at 12:34

As I commented, savetxt iterates over the 'rows' of your array, and writes each, having passed it through a fmt%tuple(row) expression. This still a line by line write; it's just another way of formatting the string.

To write both a string label and numbers you have to create a structured array, one with a compound dtype.

Shape of a structured array in numpy

has an example of doing this with one label field and one data field. This could be generalized to a several data fields.

But with 55482 columns this approach is not very practical. It may be possible, but I'm having trouble imaging this as being very efficient:

'%s, %d, %d ... %d'%('id1',1,2,3,....55481)

Rather than put each column in a named data field I could create a dtype with a multiitem data type

In [171]: ids=['id1','id2','id3']
In [172]: data=np.arange(12).reshape(3,4)

In [173]: dt=np.dtype([('id','S5'),('data',int,(4,))])
In [174]: A=np.zeros((3,),dtype=dt)
In [175]: A['id']=ids
In [176]: A['data']=data

In [177]: A
Out[177]: 
array([(b'id1', [0, 1, 2, 3]), (b'id2', [4, 5, 6, 7]),
       (b'id3', [8, 9, 10, 11])], 
      dtype=[('id', 'S5'), ('data', '<i4', (4,))])

But I don't see how this can be formatted by savetxt.

e.g. this is not satisfactory:

In [180]: for row in A:
    print('%s %s'%tuple(row))
   .....:     
b'id1' [0 1 2 3]
b'id2' [4 5 6 7]
b'id3' [ 8  9 10 11]

If you insist on writing text, then you need to format your array as text, either 'row' by 'row' or in some multiline way.

I haven't played with tofile enough to know whether it does the row by row save any better or faster.

=========================

In [205]: data=np.arange(12).reshape(3,4)

In [206]: np.savetxt('test.txt',data,fmt='%5d',delimiter=',')

In [207]: cat test.txt
    0,    1,    2,    3
    4,    5,    6,    7
    8,    9,   10,   11

row by row tofile can produce the same thing:

with open('test.txt','w') as f:
    for row in data:
        row.tofile(f,format='%5d',sep=',')
        f.write('\n')

but savetxt is faster

In [212]: %%timeit
with open('test.txt','w') as f:
    for row in data:
        row.tofile(f,format='%5d',sep=', ')
        f.write('\n')
   .....: 
1000 loops, best of 3: 341 µs per loop

In [213]: timeit np.savetxt('test.txt',data,fmt='%5d',delimiter=',')
10000 loops, best of 3: 175 µs per loop

using join format

In [233]: %%timeit
   .....: with open('test.txt','w') as f:
    for row in data:
        f.write('%s\n'%(','.join('%5d'%i for i in row)))
   .....: 
1000 loops, best of 3: 217 µs per loop

Python-performance-print large numpy array as strings to tab file

2 Answers2