0

TLDR: I am aware of this question but it breaks the mentality of lazily evaluating the dask collections. Converting lazily evaluated dask.dataframe to array with .values is problematic (memory-wise). So, how can I convert dask dataframe to scipy.sparse.csr_matrix without first converting dataframe to array?

Complete Problem: I have 3 dask dataframes. One of them has text features, one with numerical and one with categorical features. I am vectorizing the dataframe with text features using sklearn.feature_extraction.text.TfidfVectorizer which is returning a scipy.sparse.csr_matrix. I need to concatenate 3 dataframes into one (horizontally). But they have different dtypes. I also dask_ml.feature_extraction.text.HashingVectorizer. It returns lazily evaluated dask.array but .compute() is returning scipy.sparse.csr_matrix. Without .compute() when I try to convert it to dask.dataframe as below:

import dask.array as da
import dask.dataframe as dd
.
.
.
# Here the fitting the dask.dataframe, result is  lazily evaluated dask.array
X = vectorizer.fit_transform(X)
print(X.compute())
X = dd.from_dask_array(X)
print(X.compute())

X = vectorizer.fit_transform(X) returns:

dask.array<_transform, shape=(nan, 1000), dtype=float64, chunksize=(nan, 1000), chunktype=numpy.ndarray>

First print(X.compute()) returns a csr_matrix:

  (0, 73)   2.0
  (0, 95)   3.0
  (0, 286)  1.0
  (0, 340)  2.0
  (0, 373)  3.0
  (0, 379)  3.0
  (0, 387)  1.0
  (0, 407)  2.0
  (0, 421)  1.0
  (0, 479)  1.0
  (0, 482)  3.0
  (0, 515)  1.0
  (0, 520)  1.0
  (0, 560)  4.0
  (0, 596)  1.0
  (0, 620)  4.0
  (0, 630)  1.0
  (0, 648)  2.0
  (0, 680)  1.0
  (0, 721)  1.0
  (0, 760)  3.0
  (0, 824)  4.0
  (0, 826)  12.0
  (0, 880)  2.0
  (0, 908)  1.0
  : :
  (10, 985) 1.0
  (11, 95)  3.0
  (11, 171) 4.0
  (11, 259) 4.0
  (11, 276) 1.0
  (11, 352) 3.0
  (11, 358) 1.0
  (11, 436) 1.0
  (11, 485) 1.0
  (11, 507) 3.0
  (11, 553) 1.0
  (11, 589) 1.0
  (11, 604) 1.0
  (11, 619) 3.0
  (11, 625) 2.0
  (11, 719) 1.0
  (11, 826) 6.0
  (11, 858) 2.0
  (11, 880) 3.0
  (11, 908) 1.0
  (11, 925) 2.0
  (11, 930) 4.0
  (11, 968) 1.0
  (11, 975) 1.0
  (11, 984) 4.0

the X = dd.from_dask_array(X) returns:

Dask DataFrame Structure:
                   0        1        2        3        4        5        6        7        8        9        10       11       12       13       14       15       16       17       18       19       20       21       22       23  

Second print(X.compute()) returns the error:

ValueError: Shape of passed values is (6, 1), indices imply (6, 1000)

So, I can't also convert a csr_matrix to dask.dataframe.

UPDATE: I've just realized that using .values on a dask.dataframe is actually returning a lazily evaluated dask.array. This is still not something I want but at least it is not returning a solid dataframe or array on my local machine.

MehmedB
  • 1,059
  • 1
  • 16
  • 42
  • 1
    I'm confused. Which way is the conversion going? The start says `dask to csr`, the end `csr to dask`. Seems that `vectorizer` is successfully creating a `csr` matrix. – hpaulj Dec 30 '19 at 17:52
  • @hpaulj I am sorry for the confusion. In the end, I tried to explain the actual reason for me trying to convert the dask dataframes to CSR matrices. And the actual reason is that there is no way to receive the vectorized data as a dataframe. The vectorizer only returns a CSR matrix. And I also can't convert the CSR matrix to the dask dataframe. I thought it would be easier to convert a dataframe to a sparse format. That's why the TLDR section asks that question. – MehmedB Dec 30 '19 at 18:06
  • `csr_matrix` shows inputs that you can use to make such a matrix. The simplest is to just provide a `numpy` array, from which it extracts the nonzero elements. You can provide the nonzero elements directly, usually in the form of 3 arrays (in effect the list of tuples that the `csr` printout shows). – hpaulj Dec 30 '19 at 18:18
  • Yes, you are right about the simplest way. But densifying the data is also a costly way. That's why I said; "I am aware of this question but it breaks the mentality of lazily evaluating the dask collections." And the question is: https://stackoverflow.com/questions/20459536/convert-pandas-dataframe-to-sparse-numpy-matrix-directly Currently, all of my computations are in a cluster of machines running distributed across them. – MehmedB Dec 30 '19 at 18:24
  • @hpaulj do you think converting a CSR to dataframe would be easier in this case? – MehmedB Dec 30 '19 at 18:26
  • 1
    One way or other you have to specify all the nonzero elements of the matrix. The use of sparse matrix format just allows you to skip the zeros. This sort of conversion makes most sense when the proportion on nonzeros is 10% or less. – hpaulj Dec 30 '19 at 18:29
  • Well... No luck eh? Then I'm gonna wait for dask to implement sparse dataframes as pandas did. And I will wait for someone to come up with a magical solution. Thank you anyways. – MehmedB Dec 30 '19 at 18:31

1 Answers1

0

Best possible way is to actually convert it to csr matrix and make it dense.

from scipy.sparse import csr_matrix


dd.from_dask_array(X.map_blocks(lambda x: csr_matrix(x).todense()))
alwaysprep
  • 92
  • 6