TLDR: I am aware of this question but it breaks the mentality of lazily evaluating the dask collections. Converting lazily evaluated dask.dataframe
to array
with .values
is problematic (memory-wise). So, how can I convert dask dataframe to scipy.sparse.csr_matrix
without first converting dataframe to array?
Complete Problem:
I have 3 dask dataframes. One of them has text features, one with numerical and one with categorical features. I am vectorizing the dataframe with text features using sklearn.feature_extraction.text.TfidfVectorizer
which is returning a scipy.sparse.csr_matrix
. I need to concatenate 3 dataframes into one (horizontally). But they have different dtype
s. I also dask_ml.feature_extraction.text.HashingVectorizer
. It returns lazily evaluated dask.array
but .compute()
is returning scipy.sparse.csr_matrix
. Without .compute()
when I try to convert it to dask.dataframe
as below:
import dask.array as da
import dask.dataframe as dd
.
.
.
# Here the fitting the dask.dataframe, result is lazily evaluated dask.array
X = vectorizer.fit_transform(X)
print(X.compute())
X = dd.from_dask_array(X)
print(X.compute())
X = vectorizer.fit_transform(X)
returns:
dask.array<_transform, shape=(nan, 1000), dtype=float64, chunksize=(nan, 1000), chunktype=numpy.ndarray>
First print(X.compute())
returns a csr_matrix:
(0, 73) 2.0
(0, 95) 3.0
(0, 286) 1.0
(0, 340) 2.0
(0, 373) 3.0
(0, 379) 3.0
(0, 387) 1.0
(0, 407) 2.0
(0, 421) 1.0
(0, 479) 1.0
(0, 482) 3.0
(0, 515) 1.0
(0, 520) 1.0
(0, 560) 4.0
(0, 596) 1.0
(0, 620) 4.0
(0, 630) 1.0
(0, 648) 2.0
(0, 680) 1.0
(0, 721) 1.0
(0, 760) 3.0
(0, 824) 4.0
(0, 826) 12.0
(0, 880) 2.0
(0, 908) 1.0
: :
(10, 985) 1.0
(11, 95) 3.0
(11, 171) 4.0
(11, 259) 4.0
(11, 276) 1.0
(11, 352) 3.0
(11, 358) 1.0
(11, 436) 1.0
(11, 485) 1.0
(11, 507) 3.0
(11, 553) 1.0
(11, 589) 1.0
(11, 604) 1.0
(11, 619) 3.0
(11, 625) 2.0
(11, 719) 1.0
(11, 826) 6.0
(11, 858) 2.0
(11, 880) 3.0
(11, 908) 1.0
(11, 925) 2.0
(11, 930) 4.0
(11, 968) 1.0
(11, 975) 1.0
(11, 984) 4.0
the X = dd.from_dask_array(X)
returns:
Dask DataFrame Structure:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Second print(X.compute())
returns the error:
ValueError: Shape of passed values is (6, 1), indices imply (6, 1000)
So, I can't also convert a csr_matrix to dask.dataframe.
UPDATE: I've just realized that using
.values
on adask.dataframe
is actually returning a lazily evaluateddask.array
. This is still not something I want but at least it is not returning a solid dataframe or array on my local machine.