I am learning feature extraction from text document and found this tutorial. I could not understand what np.asarray(doc_counts.sum(axis=0)).ravel()
in 3rd line from end is returning. I checked this and it returned list of numbers. I guess it is term-frequency, but I am not sure.
And what idx: -1 * idx[1]
is doing, multiplying by -1 in particular. I checked if the zip()
function has .idx
associated to access element, but could not find.
Code:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
docs = <load your docs as an iterable>
count_vect = CountVectorizer()
doc_counts = count_vect.fit_transform(docs)
word_counts = zip(count_vect.get_feature_names(), np.asarray(doc_counts.sum(axis=0)).ravel())
word_counts = sorted(word_counts, key=lambda idx: -1 * idx[1] )
# Display top 100 words by frequency
word_counts[:100]
Could someone please explain these two line.
Thanks in advance.