11

I've tried to initialize csc_matrix and csr_matrix from a list of (data, (rows, cols)) values as the documentation suggests.

sparse = csc_matrix((data, (rows, cols)), shape=(n, n))

The problem is that, the method that I actually have for generating the data, rows and cols vectors introduces duplicates for some points. By default, scipy adds the values of the duplicate entries. However, in my case, those duplicates have exactly the same value in data for a given (row, col).

What I'm trying to achieve is to make scipy ignore the second entry if already exists one, instead of adding them.

Ignoring the fact that I could improve the generation algorithm to avoid generating duplicates, is there a parameter or another way of creating a sparse matrix that ignores duplicates?

Currently two entries with data = [4, 4]; cols = [1, 1]; rows = [1, 1]; generate a sparse matrix which value at (1,1) is 8 while the desired value is 4.

>>> c = csc_matrix(([4, 4], ([1,1],[1,1])), shape=(3,3))
>>> c.todense()
matrix([[0, 0, 0],
        [0, 8, 0],
        [0, 0, 0]])

I'm also aware that I could filter them by using a 2-dimensional numpy unique function, but lists are quite large so this is not really a valid option.

Other possible answer to the question: Is there any way of specifying what to do with duplicates? i.e. keeping the min or max instead of the default sum?

Imanol Luengo
  • 15,366
  • 2
  • 49
  • 67
  • I am pretty sure the answer is no, there is no built-in way of changing the behavior for duplicates. You shouldn't be too quick on discarding the use of `np.unique` though: no matter how large your lists are, scipy is going to convert them to arrays and do similar operations under the hood, so there is no reason why you shouldn't attempt to. – Jaime Feb 23 '15 at 16:53
  • 1
    `np.unique` is 1d, so handling these 2d coordinates will require some extra effort. – hpaulj Feb 23 '15 at 20:51
  • 1
    True, but tricks like [this](http://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array) make it 2D. – Imanol Luengo Feb 23 '15 at 21:02

3 Answers3

9

Creating an intermediary dok matrix works in your example:

In [410]: c=sparse.coo_matrix((data, (cols, rows)),shape=(3,3)).todok().tocsc()

In [411]: c.A
Out[411]: 
array([[0, 0, 0],
       [0, 4, 0],
       [0, 0, 0]], dtype=int32)

A coo matrix puts your input arrays into its data,col,row attributes without change. The summing doesn't occur until it is converted to a csc.

todok loads the dictionary directly from the coo attributes. It creates the blank dok matrix, and fills it with:

dok.update(izip(izip(self.row,self.col),self.data))

So if there are duplicate (row,col) values, it's the last one that remains. This uses the standard Python dictionary hashing to find the unique keys.


Here's a way of using np.unique. I had to construct a special object array, because unique operates on 1d, and we have a 2d indexing.

In [479]: data, cols, rows = [np.array(j) for j in [[1,4,2,4,1],[0,1,1,1,2],[0,1,2,1,1]]]

In [480]: x=np.zeros(cols.shape,dtype=object)

In [481]: x[:]=list(zip(rows,cols))

In [482]: x
Out[482]: array([(0, 0), (1, 1), (2, 1), (1, 1), (1, 2)], dtype=object)

In [483]: i=np.unique(x,return_index=True)[1]

In [484]: i
Out[484]: array([0, 1, 4, 2], dtype=int32)

In [485]: c1=sparse.csc_matrix((data[i],(cols[i],rows[i])),shape=(3,3))

In [486]: c1.A
Out[486]: 
array([[1, 0, 0],
       [0, 4, 2],
       [0, 1, 0]], dtype=int32)

I have no idea which approach is faster.


An alternative way of getting the unique index, as per liuengo's link:

rc = np.vstack([rows,cols]).T.copy()
dt = rc.dtype.descr * 2
i = np.unique(rc.view(dt), return_index=True)[1]

rc has to own its own data in order to change the dtype with view, hence the .T.copy().

In [554]: rc.view(dt)
Out[554]: 
array([[(0, 0)],
       [(1, 1)],
       [(2, 1)],
       [(1, 1)],
       [(1, 2)]], 
      dtype=[('f0', '<i4'), ('f1', '<i4')])
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Looks like a nice trick. I can't try it right now, does it take a lot of time/memory the conversion between types? – Imanol Luengo Feb 23 '15 at 21:06
  • 3
    todok() doesn't ignoring duplicates at scipy 0.19 – WeiChing 林煒清 Sep 13 '17 at 03:22
  • 1
    Yes, `coo.todok` now includes a `self.sum_duplicates()` line. The solution is to do the update as I describe, but without this `sum`. – hpaulj Sep 13 '17 at 05:08
  • @hpaulj what solution? you mean using `todok()` function code without the line ` self.sum_duplicates()` ? – SarahData Sep 03 '18 at 14:13
  • Seems like `i=np.unique(x,return_index=True)[1]` should be `i=np.unique(np.unique(x,return_index=True))` instead to yield the output of `[0, 1, 2, 4]`. – qwerty Apr 13 '23 at 17:30
4

Since the values in your data at repeating (row, col) are the same, you can get the unique rows, columns and values as follows:

rows, cols, data = zip(*set(zip(rows, cols, data)))

Example:

data = [4, 3, 4]
cols = [1, 2, 1]
rows = [1, 3, 1]

csc_matrix((data, (rows, cols)), shape=(4, 4)).todense()

matrix([[0, 0, 0, 0],
        [0, 8, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 3, 0]])



rows, cols, data = zip(*set(zip(rows, cols, data)))
csc_matrix((data, (rows, cols)), shape=(4, 4)).todense()

matrix([[0, 0, 0, 0],
        [0, 4, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 3, 0]])
Andreas K.
  • 9,282
  • 3
  • 40
  • 45
2

Just to update hpaulj's answer to the most recent version of SciPy, the simplest solution to this problem is now, given a COO matrix c now:

dok=sparse.dok_matrix((c.shape),dtype=c.dtype)
dok._update(zip(zip(c.row,c.col),c.data))

new_c = dok.tocsc()

This is due to a new wrapper in the dok update() function, preventing it from direct changes to the array, requiring the use of the underscore to bypass the wrapper.

Daniel
  • 31
  • 3