Consider the simple example below
data = fetch_20newsgroups(subset="train", shuffle=True, random_state=42,categories = [
"alt.atheism"])
vec = TfidfVectorizer(min_df= 3, max_df=0.5, ngram_range = (2,3))
X = vec.fit_transform(data.data)
<480x17622 sparse matrix of type '<class 'numpy.float64'>'
with 111502 stored elements in Compressed Sparse Row format>
I am using scikit
to represent textual data using sparse matrices. I know I can get the column names of the sparse matrix using:
list(vec.get_feature_names_out())
['01 lines',
'023044 19580',
'023044 19580 ultb',
'041343 24997',...]
and I know I can sum the occurrence for each word using .sum()
on the sparse matrix directly.
X.sum(axis = 0)
matrix([[0.77497472, 0.19175863, 0.19175863, ..., 0.29521438, 0.15458728,
0.15458728]])
The issue is that this operation returns a matrix
whereas I need a simple list of floats instead like
[0.77497472, 0.19175863, 0.19175863, ..., 0.29521438, 0.15458728, 0.15458728]
What is the proper way to extract that list? Thanks!