4

I'm writing a machine learning algorithm on huge & sparse data (my matrix is of shape (347, 5 416 812 801) but very sparse, only 0.13% of the data is non zero.

My sparse matrix's size is 105 000 bytes (<1Mbytes) and is of csr type.

I'm trying to separate train/test sets by choosing a list of examples indices for each. So I want to split my dataset in two using :

training_set = matrix[train_indices]

of shape (len(training_indices), 5 416 812 801), still sparse

testing_set = matrix[test_indices]

of shape (347-len(training_indices), 5 416 812 801) also sparse

With training_indices and testing_indices two list of int

But training_set = matrix[train_indices] seems to fail and return a Segmentation fault (core dumped)

It might not be a problem of memory, as I'm running this code on a server with 64Gbytes of RAM.

Any clue on what could be the cause ?

hpaulj
  • 221,503
  • 14
  • 230
  • 353
Doob
  • 45
  • 1
  • 4
  • My guess is a MemoryError that isn't being trapped nicely. You may have to study `matrix.__getitem__` (the indexing method) to see how it does the selection. Each sparse format does its own indexing. `lil` and `csr` should handle row index well. `coo` doesn't handle index at all. Indexing sparse matrices isn't hidden in compiled code like it is for arrays (and it isn't as fast). – hpaulj Sep 14 '16 at 22:47
  • I'll check that, but as I'm using `csr` and trying to fetch rows, it should be fine – Doob Sep 14 '16 at 22:50
  • Which version of scipy are you using? You can check with `import scipy; print(scipy.__version__)` – Warren Weckesser Sep 15 '16 at 03:23
  • A search on SO or scipy github regarding 'sparse' and 'segmentation fault' might be in order. – hpaulj Sep 15 '16 at 15:59

1 Answers1

3

I think I've recreated the csr row indexing with:

def extractor(indices, N):
   indptr=np.arange(len(indices)+1)
   data=np.ones(len(indices))
   shape=(len(indices),N)
   return sparse.csr_matrix((data,indices,indptr), shape=shape)

Testing on a csr I had hanging around:

In [185]: M
Out[185]: 
<30x40 sparse matrix of type '<class 'numpy.float64'>'
    with 76 stored elements in Compressed Sparse Row format>

In [186]: indices=np.r_[0:20]

In [187]: M[indices,:]
Out[187]: 
<20x40 sparse matrix of type '<class 'numpy.float64'>'
    with 57 stored elements in Compressed Sparse Row format>

In [188]: extractor(indices, M.shape[0])*M
Out[188]: 
<20x40 sparse matrix of type '<class 'numpy.float64'>'
    with 57 stored elements in Compressed Sparse Row format>

As with a number of other csr methods, it uses matrix multiplication to produce the final value. In this case with a sparse matrix with 1 in selected rows. Time is actually a bit better.

In [189]: timeit M[indices,:]
1000 loops, best of 3: 515 µs per loop
In [190]: timeit extractor(indices, M.shape[0])*M
1000 loops, best of 3: 399 µs per loop

In your case the extractor matrix is (len(training_indices),347) in shape, with only len(training_indices) values. So it is not big.

But if the matrix is so large (or at least the 2nd dimension so big) that it produces some error in the matrix multiplication routines, it could give rise to segmentation fault without python/numpy trapping it.

Does matrix.sum(axis=1) work. That too uses a matrix multiplication, though with a dense matrix of 1s. Or sparse.eye(347)*M, a similar size matrix multiplication?

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Indeed, neither works, `sum` returns `IndexError: index 21870 out-of-bounds in add.reduceat [0, 20815)` and the matrix multiplication returns a segmentation fault. So my only solution is to write by myself a slower but more memory efficient code to slice my matrix ? – Doob Sep 15 '16 at 12:50
  • While my first guess was a memory error, I now suspect it's the large number of columns, or rather the large value of some of the column indices. If it can't sum or do matrix multiplication it probably will have problems in the learning code. – hpaulj Sep 15 '16 at 15:54