extracting data from a sparse matrix

Asked Jan 04 '22 at 21:47

Active Jan 04 '22 at 21:53

Viewed 380 times

Consider the simple example below

data = fetch_20newsgroups(subset="train", shuffle=True, random_state=42,categories = [
        "alt.atheism"])

vec = TfidfVectorizer(min_df= 3, max_df=0.5, ngram_range = (2,3))
X = vec.fit_transform(data.data)

<480x17622 sparse matrix of type '<class 'numpy.float64'>'
    with 111502 stored elements in Compressed Sparse Row format>

I am using scikit to represent textual data using sparse matrices. I know I can get the column names of the sparse matrix using:

list(vec.get_feature_names_out())
['01 lines',
 '023044 19580',
 '023044 19580 ultb',
 '041343 24997',...]

and I know I can sum the occurrence for each word using .sum() on the sparse matrix directly.

X.sum(axis = 0)
matrix([[0.77497472, 0.19175863, 0.19175863, ..., 0.29521438, 0.15458728,
         0.15458728]])

The issue is that this operation returns a matrix whereas I need a simple list of floats instead like

[0.77497472, 0.19175863, 0.19175863, ..., 0.29521438, 0.15458728, 0.15458728]

What is the proper way to extract that list? Thanks!

edited Jan 04 '22 at 21:53

asked Jan 04 '22 at 21:47

ℕʘʘḆḽḘ

18,566
34
128
235

May `list(X.toarray().sum(axis=0))` help? – amiola Jan 04 '22 at 22:00
1

`x.sum(axis=0).A1`. `A1` is a shortcut for returning a 1d array from a (1,n) `np.matrix` object. – hpaulj Jan 04 '22 at 22:06
@hpaulj this is nice! where is the documentation for `A1`? can you post this as an answer perhaps? – ℕʘʘḆḽḘ Jan 04 '22 at 23:20
See my answer to https://stackoverflow.com/questions/3337301/numpy-matrix-to-array. – hpaulj Jan 04 '22 at 23:38

extracting data from a sparse matrix

0 Answers0