3

I am trying to convert a COO type sparse matrix (from Scipy.Sparse) to a Pandas sparse series. From the documentation(http://pandas.pydata.org/pandas-docs/stable/sparse.html) it says to use the command SparseSeries.from_coo(A). This seems to be OK, but when I try to see the series' attributes, this is what happens.

10x10 seems OK.

import pandas as pd 
import scipy.sparse as ss 
import numpy as np 
row = (np.random.random(10)*10).astype(int) 
col = (np.random.random(10)*10).astype(int) 
val = np.random.random(10)*10 
sparse = ss.coo_matrix((val,(row,col)),shape=(10,10)) 
pss = pd.SparseSeries.from_coo(sparse)
print pss
0  7    1.416631
   9    5.833902
1  0    4.131919
2  3    2.820531
   7    2.227009
3  1    9.205619
4  4    8.309077
6  0    4.376921
7  6    8.444013
   7    7.383886
dtype: float64
BlockIndex
Block locations: array([0])
Block lengths: array([10])

But not 100x100.

import pandas as pd 
import scipy.sparse as ss 
import numpy as np 
row = (np.random.random(100)*100).astype(int) 
col = (np.random.random(100)*100).astype(int) 
val = np.random.random(100)*100 
sparse = ss.coo_matrix((val,(row,col)),shape=(100,100)) 
pss = pd.SparseSeries.from_coo(sparse)
print pss

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-790-f0c22a601b93> in <module>()
      7 sparse = ss.coo_matrix((val,(row,col)),shape=(100,100))
      8 pss = pd.SparseSeries.from_coo(sparse)
----> 9 print pss
     10 

C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\base.pyc in __str__(self)
     45         if compat.PY3:
     46             return self.__unicode__()
---> 47         return self.__bytes__()
     48 
     49     def __bytes__(self):

C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\base.pyc in __bytes__(self)
     57 
     58         encoding = get_option("display.encoding")
---> 59         return self.__unicode__().encode(encoding, 'replace')
     60 
     61     def __repr__(self):

C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\sparse\series.pyc in __unicode__(self)
    287     def __unicode__(self):
    288         # currently, unicode is same as repr...fixes infinite loop
--> 289         series_rep = Series.__unicode__(self)
    290         rep = '%s\n%s' % (series_rep, repr(self.sp_index))
    291         return rep

C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\series.pyc in __unicode__(self)
    895 
    896         self.to_string(buf=buf, name=self.name, dtype=self.dtype,
--> 897                        max_rows=max_rows)
    898         result = buf.getvalue()
    899 

C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\series.pyc in to_string(self, buf, na_rep, float_format, header, length, dtype, name, max_rows)
    960         the_repr = self._get_repr(float_format=float_format, na_rep=na_rep,
    961                                   header=header, length=length, dtype=dtype,
--> 962                                   name=name, max_rows=max_rows)
    963 
    964         # catch contract violations

C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\series.pyc in _get_repr(self, name, header, length, dtype, na_rep, float_format, max_rows)
    989                                         na_rep=na_rep,
    990                                         float_format=float_format,
--> 991                                         max_rows=max_rows)
    992         result = formatter.to_string()
    993 

C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\format.pyc in __init__(self, series, buf, length, header, na_rep, name, float_format, dtype, max_rows)
    145         self.dtype = dtype
    146 
--> 147         self._chk_truncate()
    148 
    149     def _chk_truncate(self):

C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\format.pyc in _chk_truncate(self)
    158             else:
    159                 row_num = max_rows // 2
--> 160                 series = concat((series.iloc[:row_num], series.iloc[-row_num:]))
    161             self.tr_row_num = row_num
    162         self.tr_series = series

C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\tools\merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    752                        keys=keys, levels=levels, names=names,
    753                        verify_integrity=verify_integrity,
--> 754                        copy=copy)
    755     return op.get_result()
    756 

C:\Users\ej\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\tools\merge.pyc in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
    803         for obj in objs:
    804             if not isinstance(obj, NDFrame):
--> 805                 raise TypeError("cannot concatenate a non-NDFrame object")
    806 
    807             # consolidate

TypeError: cannot concatenate a non-NDFrame object

I don't really understand the error message - I think I am following the example in the documentation to the letter, just using my own COO matrix (could it be the size?)

Regards

hpaulj
  • 221,503
  • 14
  • 230
  • 353
Francesco
  • 390
  • 4
  • 15
  • Yeah, looks OK to me at first glance. Maybe is size related as you speculate. Does it work on smaller matrices? – JohnE Aug 12 '15 at 18:27
  • Nope. See screenshot: http://imgur.com/X4d8cL5, unless you consider a 162x95 sparse matrix too large?! Do you think it could be a bug then? Thank you for your help. – Francesco Aug 13 '15 at 08:32
  • No, it's not that big. Best way to trouble shoot or prove it is a bug is to post actual sample data so others can replicate. – JohnE Aug 13 '15 at 12:07
  • @JohnE, thanks. OK, not sure where best to put the test code, but here it is: `import pandas as pd` `import scipy.sparse as ss` `import numpy as np` `row = (np.random.random(100)*100).astype(int)` `col = (np.random.random(100)*100).astype(int)` `val = np.random.random(100)*100` `sparse = ss.coo_matrix((val,(row,col)),shape=(100,100))` `pss = pd.SparseSeries.from_coo(sparse)` `pss` This gives me the same error. – Francesco Aug 13 '15 at 20:58
  • I have only dabbled with sparse matrices so I can't say what is going on. If you don't get any suggestions here on SO, you may want to raise an issue at github: https://github.com/pydata/pandas/issues – JohnE Aug 13 '15 at 21:39
  • 1
    Best thing is to put the code in the original question. I replicated the problem with your code whereas it seems to work fine for a 10x10 instead of 100x100. Ideally show both: how it works for 10x10 and not for 100x100. Actually, I'll go ahead and edit it in but please alter or add to it as you like. – JohnE Aug 13 '15 at 21:46
  • I think the way you are creating the matrix allows it to have overlapping entries -- e.g. 2 different values could be mapped to row 2, column 6. I doubt that is the problem but I suspect that is not really a good way to do it either. – JohnE Aug 13 '15 at 21:58
  • By default, the coo_matrix adds the values in `data` which have the same index position. This is actually a useful feature, particularly if you want to down-sample your data (you simply divide the `row` or `column` elements by your bin step). I am pretty sure this happens in my examples, so perhaps it's that... – Francesco Aug 13 '15 at 22:20
  • `coo_matrix()` does not actually sum duplicate values; it just stores those 3 input arrays in its attributes (without copy or change). The summation occurs when the matrix is converted to another format such as `csr`, or when it is displayed. It may be worth trying a `sparse=sparse.tocsr().tocoo()` round trip just to cleanup any duplication. – hpaulj Dec 09 '15 at 23:04

1 Answers1

0

I have an older pandas. It has the sparse code, but not the tocoo. The pandas issue that has been filed in connection with this is: https://github.com/pydata/pandas/issues/10818

But I found on github that:

def _coo_to_sparse_series(A, dense_index=False):
    """ Convert a scipy.sparse.coo_matrix to a SparseSeries.
    Use the defaults given in the SparseSeries constructor. """
    s = Series(A.data, MultiIndex.from_arrays((A.row, A.col)))
    s = s.sort_index()
    s = s.to_sparse()  # TODO: specify kind?
    # ...
    return s

With a smallish sparse matrix I construct and display without problems:

In [259]: Asml=sparse.coo_matrix(np.arange(10*5).reshape(10,5))
In [260]: s=pd.Series(Asml.data,pd.MultiIndex.from_arrays((Asml.row,Asml.col)))
In [261]: s=s.sort_index()
In [262]: s
Out[262]: 
0  1     1
   2     2
   3     3
   4     4
1  0     5
   1     6
   2     7
 [...  mine]
   3    48
   4    49
dtype: int32
In [263]: ssml=s.to_sparse()
In [264]: ssml
Out[264]: 
0  1     1
   2     2
   3     3
   4     4
1  0     5
  [...  mine]
   2    47
   3    48
   4    49
dtype: int32
BlockIndex
Block locations: array([0])
Block lengths: array([49])

but with a larger array (more nonzero elements) I get a display error. I'm guessing it happens when the display for the (plain) series starts to use an ellipsis (...). I'm running in Py3, so I get a different error message.

....\pandas\core\base.pyc in __str__(self)
     45         if compat.PY3:
     46             return self.__unicode__()   # py3
     47         return self.__bytes__()         # py2 route

e.g.:

In [265]: Asml=sparse.coo_matrix(np.arange(10*7).reshape(10,7))
In [266]: s=pd.Series(Asml.data,pd.MultiIndex.from_arrays((Asml.row,Asml.col)))
In [267]: s=s.sort_index()
In [268]: s
Out[268]: 
0  1     1
   2     2
   3     3
   4     4
   5     5
   6     6
1  0     7
   1     8
   2     9
   3    10
   4    11
   5    12
   6    13
2  0    14
   1    15
...
7  6    55
8  0    56
   1    57
[... mine]
Length: 69, dtype: int32
In [269]: ssml=s.to_sparse()
In [270]: ssml
Out[270]: <repr(<pandas.sparse.series.SparseSeries at 0xaff6bc0c>)
failed: AttributeError: 'SparseArray' object has no attribute '_get_repr'>

I'm not sufficiently familiar with pandas code and structures to deduce much more for now.

hpaulj
  • 221,503
  • 14
  • 230
  • 353