4

I'm developing tooling based on pandas DataFrame objects. I would like to keep scipy sparse matrices around as column of a DataFrame without converting it row-wise to a list / numpy array of dtype('O').

The snippet below doesn't work as pandas treats the matrix as a scalar, and suggests to add an index. When providing a pd.RangeIndex over the row indices in the matrix, the matrix gets repeated for every row in the dataframe (as pandas thinks it is a scalar).

ma = scipy.sparse.rand(10, 100, 0.1, 'csr', dtype=np.float64)
df = pd.DataFrame(dict(X=ma))

This does work:

df = pd.DataFrame(dict(X=list(ma)))

However, this cuts up the matrix row-wise into CSR matrices each of 1 row. Which I would then need to vstack everytime I'd want to work on the original matrix.

Any pointers? I tried wrapping the CSR matrix into a pd.Series object, pretending it has dtype('O'), but I run into a lot of assumptions on the underlying data being numpy arrays and such.

Frens Jan
  • 319
  • 2
  • 13

2 Answers2

2

There is a sparse dataframe or dataseries feature. It is still experimental. I've answered a few SO questions about converting back and forth between that and scipy sparse matrices.

From the sidebar:

Populate a Pandas SparseDataFrame from a SciPy Sparse Coo Matrix

Without such a specialized pandas structure I don't see how a sparse matrix could be added to a pandas frame. The internal structure of a sparse matrix is too different. For a start it is not a subclass of numpy array.

A csr matrix is an object with data contained in 3 arrays, ma.data and ma.indices are 1d arrays with one value for each non-zero element of the array. ma.indptr has a value for each row of the matrix.

list(ma) is meaningless. ma.toarray() produces a 2d array with the same data, and will all those zeros filled in as well.

Other sparse matrix formats store their data in other structures - 3 equal length arrays for coo, two lists of lists for lil, and a dictionary of dok.

Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks @hpaulj for the reply. I'm actually not interested in accessing the SciPy matrix column by column through Pandas, only row wise or the matrix as a whole. I understand the mismatch between ndarray and the sparse matrix types on a memory layout level. I was hoping there was a abstraction where both could fit in ... `list(ma)` is not enterily meaningless, at least for a CSR matrix it creates CSR matrices of one row for each row in the original CSR matrix. – Frens Jan Sep 12 '16 at 10:23
  • I don't see how a matrix (2d), dense or sparse, can be used as a column of a Dataframe. I believe `pandas` will try to map the columns of a 2d array onto an equal number of columns. Cells can also be object pointers, but not whole column. But I know numpy and scipy better than I know pandas. – hpaulj Sep 12 '16 at 16:44
  • Thanks again. I understand. Would be a convenient place to store a CSR matrix of feature vectors where very row corresponds to e.g. a label in another column of the DataFrame and being able to access the underlying CSR matrix _without_ copying or stacking it back together. Can't have everything I guess :) – Frens Jan Sep 12 '16 at 18:57
0

Admittedly, this doesn't answer your question exactly, but if anyone is looking for a quick workaround, and doesn't mind storing the matrix inefficiently as dense, this can be done as:

df = pd.DataFrame(X=ma.todense().tolist())
dimid
  • 7,285
  • 1
  • 46
  • 85