0

Consider the simple example below

data = fetch_20newsgroups(subset="train", shuffle=True, random_state=42,categories = [
        "alt.atheism"])

vec = TfidfVectorizer(min_df= 3, max_df=0.5, ngram_range = (2,3))
X = vec.fit_transform(data.data)

<480x17622 sparse matrix of type '<class 'numpy.float64'>'
    with 111502 stored elements in Compressed Sparse Row format>

I am using scikit to represent textual data using sparse matrices. I know I can get the column names of the sparse matrix using:

list(vec.get_feature_names_out())
['01 lines',
 '023044 19580',
 '023044 19580 ultb',
 '041343 24997',...]

and I know I can sum the occurrence for each word using .sum() on the sparse matrix directly.

X.sum(axis = 0)
matrix([[0.77497472, 0.19175863, 0.19175863, ..., 0.29521438, 0.15458728,
         0.15458728]])

The issue is that this operation returns a matrix whereas I need a simple list of floats instead like

[0.77497472, 0.19175863, 0.19175863, ..., 0.29521438, 0.15458728, 0.15458728]

What is the proper way to extract that list? Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

0 Answers0