6

I am saving a numpy sparse array (densed) into a csv. The result is I have a 3GB csv. The problem is 95% of the cells are 0.0000. I used fmt='%5.4f'. How can I format and save such that the zeros are saved only as 0 and the non zero floats are saved with the '%5.4f' format ? I am sure I can get the 3GB down to 300MB if I can do this.

I am using

np.savetxt('foo.csv', arrayDense, fmt='%5.4f', delimiter = ',')

Thanks Regards

CT Zhu
  • 52,648
  • 17
  • 120
  • 133
Run2
  • 1,839
  • 22
  • 32
  • Using a different, non-dense storage format would likely produce better results. See http://stackoverflow.com/questions/8955448/save-load-scipy-sparse-csr-matrix-in-portable-data-format for how to do that. – user2357112 Jul 11 '14 at 07:16
  • Also, consider compressing it. `savetxt` and `loadtxt` automatically use gzip if the filename ends in `.gz`; that might be an easy way to shrink your file. – user2357112 Jul 11 '14 at 07:19

3 Answers3

9

If you look at the source code of np.savetxt, you'll see that, while there is quite a bit of code to handle the arguments and the differences between Python 2 and Python 3, it is ultimately a simple python loop over the rows, in which each row is formatted and written to the file. So you won't lose any performance if you write your own. For example, here's a pared down function that writes compact zeros:

def savetxt_compact(fname, x, fmt="%.6g", delimiter=','):
    with open(fname, 'w') as fh:
        for row in x:
            line = delimiter.join("0" if value == 0 else fmt % value for value in row)
            fh.write(line + '\n')

For example:

In [70]: x
Out[70]: 
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  1.2345    ],
       [ 0.        ,  9.87654321,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  3.14159265,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])

In [71]: savetxt_compact('foo.csv', x, fmt='%.4f')

In [72]: !cat foo.csv
0,0,0,0,1.2345
0,9.8765,0,0,0
0,3.1416,0,0,0
0,0,0,0,0
0,0,0,0,0
0,0,0,0,0

Then, as long as you are writing your own savetxt function, you might as well make it handle sparse matrices, so you don't have to convert it to a (dense) numpy array before saving it. (I assume the sparse array is implemented using one of the sparse representations from scipy.sparse.) In the following function, the only change is from ... for value in row to ... for value in row.A[0].

def savetxt_sparse_compact(fname, x, fmt="%.6g", delimiter=','):
    with open(fname, 'w') as fh:
        for row in x:
            line = delimiter.join("0" if value == 0 else fmt % value for value in row.A[0])
            fh.write(line + '\n')

Example:

In [112]: a
Out[112]: 
<6x5 sparse matrix of type '<type 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>

In [113]: a.A
Out[113]: 
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  1.2345    ],
       [ 0.        ,  9.87654321,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  3.14159265,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])

In [114]: savetxt_sparse_compact('foo.csv', a, fmt='%.4f')

In [115]: !cat foo.csv
0,0,0,0,1.2345
0,9.8765,0,0,0
0,3.1416,0,0,0
0,0,0,0,0
0,0,0,0,0
0,0,0,0,0
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • Thanks a lot Warren. This will definitely work. My sparse matrix was a result of a transform on a TfidfVectorizer model. It returns a float64 double dimension sparse array like below (say I am considering 10 top terms) \n`(0, 9) 0.434529124115 (0, 8) 0.506103404485 (0, 6) 0.342163203439 (0, 5) 0.114195114018 (0, 4) 0.228240906166 (0, 0) 0.506863556372 (1, 9) 0.179650406184 (1, 8) 0.650974675792 (1, 5) 0.385568606136 (1, 3) 0.0601214405201 (1, 2) 0.117613972075 (1, 1) 0.34801600856 (1, 0) 0.27164684163 ...` . Btw the 0.4g fmt works too by itself. – Run2 Jul 13 '14 at 18:18
5

Another simple option that may work given your requirements is the 'g' specifier. If you care more about significant digits and less about seeing exactly x number of digits and don't mind it switching between scientific and float, this does the trick well. For example:

np.savetxt("foo.csv", arrayDense, fmt='%5.4g', delimiter=',') 

If arrayDense is this:

matrix([[ -5.54900000e-01,   0.00000000e+00,   0.00000000e+00],
    [  0.00000000e+00,   3.43560000e-08,   0.00000000e+00],
    [  0.00000000e+00,   0.00000000e+00,   3.43422000e+01]])

Your way would yield:

-0.5549,0.0000,0.0000
0.0000,0.0000,0.0000
0.0000,0.0000,34.3422

The above would yield instead:

-0.5549,    0,    0
0,3.436e-08,    0
0,    0,34.34

This way is also more flexible. Notice that using 'g' instead of 'f', you don't lose data (i.e. 3.4356e-08 instead of 0.0000). This obviously is dependent on what you set your precision to however.

Eric
  • 821
  • 6
  • 5
2

It would be much better if you saved only the non-zeros entries in your sparse matrix (m in the example below), you could achieve that doing:

fname = 'row_col_data.txt'
m = m.tocoo()
a = np.vstack((m.row, m.col, m.data)).T
header = '{0}, {1}'.format(*m.shape)
np.savetxt(fname, a, header=header, fmt=('%d', '%d', '%5.4f'))

and the sparse matrix can be recomposed doing:

row, col, data = np.loadtxt(fname, skiprows=1, unpack=True)
shape = map(int, open(fname).next()[1:].split(','))
m = coo_matrix((data, (row, col)), shape=shape)
Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
  • Hi Castro - thanks for that reply. I learnt a lot from that. But, the thing is I need the csv in (n,m) rows columns format and all the m columns. This is because I need to load it in WEKA and SMOTE it. Your approach is creating an xls , in (n1,2) rows columns format and is also missing the 0 values. – Run2 Jul 11 '14 at 09:24
  • CT Zhu had answered correctly - but for some reason the post is deleted. I cannot choose it as the correct answer. Only using `fmt='%.4g'` while saving solved it. I will add an answer if CT Zhu does not add that post back again in some days. – Run2 Jul 11 '14 at 09:33
  • @Run2 the `0` values are not missing, the thing is that they are not stored in a sparse matrix, that's the main purpose to use this type of matrix so I believe you don't have to worry with the `0` values... if you need a dense array you can do `m.toarray()`, where you can see the zeros... – Saullo G. P. Castro Jul 11 '14 at 10:25
  • @Run2 if you use the `g` formatter, be aware that the number of significant digits is given by `%numg`, such that if you have `1.12345` you need `%6g` and if you have `111.12345` you need `%8g` to get 5 digits of decimal precision – Saullo G. P. Castro Jul 11 '14 at 10:50
  • @Run2: Also note that the `%.4g` format will convert to scientific notation with very large or very small numbers (e.g. `1.2345e-12`). If that is acceptable, or if your numbers are in a range such that this does not happen, then CT Zhu's deleted answer certainly looks like the simplest solution. – Warren Weckesser Jul 11 '14 at 14:19