58

I am creating a matrix from a Pandas dataframe as follows:

dense_matrix = np.array(df.as_matrix(columns = None), dtype=bool).astype(np.int)

And then into a sparse matrix with:

sparse_matrix = scipy.sparse.csr_matrix(dense_matrix)

Is there any way to go from a df straight to a sparse matrix?

Thanks in advance.

tashuhka
  • 5,028
  • 4
  • 45
  • 64
user7289
  • 32,560
  • 28
  • 71
  • 88

3 Answers3

74

df.values is a numpy array, and accessing values that way is always faster than np.array.

scipy.sparse.csr_matrix(df.values)

You might need to take the transpose first, like df.values.T. In DataFrames, the columns are axis 0.

Dan Allan
  • 34,073
  • 6
  • 70
  • 63
  • 1
    But this is suppose to generate a memory copy, isn't it? As df.values is essentially returning a dense matrix, and cast to csr_matrix handle. It doesn't work for large matrix. – Jake0x32 Jun 27 '15 at 03:44
  • No, if I understand correctly `df.values` does not make a copy. – Dan Allan Jul 02 '15 at 19:05
  • 4
    Another way would be to do e.g. df.replace(0, np.nan).to_sparse(), which results to a sparse DataFrame though, not a scipy.sparse.csr_matrix ... – ntg Apr 22 '16 at 03:06
  • 2
    df.values creates a dense matrix if df is a SparseDataFrame. Impracticable for large dataset. – Stan Nov 15 '16 at 16:26
  • If you want to convert nans to sparse, then you have to do fillna first and then convert. – TheRajVJain Aug 29 '17 at 07:45
  • 2
    @Stan any solution in case of very large dataset ? – SarahData Aug 29 '18 at 11:07
  • And btw. OP asked for a 'direct' solution. You are converting dataframe to numpy array and then csr_matrix. You are literally densifying the dataframe, creating 'object's by converting Nan's inside a dataframe. Am I missing something here? Why is this an accepted answer? I don't understand. – MehmedB Dec 31 '19 at 07:42
  • Now I guess I understand. Since df.values doesn't return a copy, this is actually a direct conversion? – MehmedB Feb 01 '20 at 11:17
  • This does not work because `df.values` is returning a regular numpy matrix. – Jiang Xiang Sep 23 '21 at 18:54
  • See solution below please. – G. Cohen Jan 07 '22 at 15:40
4

There is a way to do it without converting to dense en route: csr_sparse_matrix = df.sparse.to_coo().tocsr()

G. Cohen
  • 604
  • 5
  • 4
  • I got this error: `AttributeError: Can only use the '.sparse' accessor with Sparse data.` I think pandas does not allow to run it directly. – nomad culture Jan 06 '22 at 14:13
  • 3
    `df` has to be a sparse data frame. Convert dense data frame to sparse one via:`sparse_df = df.astype(pd.SparseDtype("float64",0)` – G. Cohen Jan 06 '22 at 22:50
2

Solution:

import pandas as pd
import scipy
from scipy.sparse import csr_matrix

csr_matrix = csr_matrix(df.astype(pd.SparseDtype("float64",0)).sparse.to_coo())

Explanation:

to_coo needs the pd.DataFrame to be in a sparse format, so the dataframe will need to be converted to a sparse datatype: df.astype(pd.SparseDtype("float64",0))

After it is converted to a COO matrix, it can be converted to a CSR matrix.

Rodalm
  • 5,169
  • 5
  • 21