-3

I am learning feature extraction from text document and found this tutorial. I could not understand what np.asarray(doc_counts.sum(axis=0)).ravel() in 3rd line from end is returning. I checked this and it returned list of numbers. I guess it is term-frequency, but I am not sure.

And what idx: -1 * idx[1] is doing, multiplying by -1 in particular. I checked if the zip() function has .idx associated to access element, but could not find.

Code:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

docs = <load your docs as an iterable>

count_vect = CountVectorizer()
doc_counts = count_vect.fit_transform(docs)  

word_counts = zip(count_vect.get_feature_names(), np.asarray(doc_counts.sum(axis=0)).ravel())
word_counts = sorted(word_counts, key=lambda idx: -1 * idx[1] )

# Display top 100 words by frequency
word_counts[:100]

Could someone please explain these two line.

Thanks in advance.

Om Prakash
  • 2,675
  • 4
  • 29
  • 50
  • Each item `idx` in `word_counts` is sorted *as if* it were `-1 * idx[1]`. – timgeb Sep 21 '17 at 19:55
  • I really don't understand the motive behind down-voting this question. Is it because **clicking on ***down-vote button**** is easier than answering the question and make the beginner run away from SO? – Om Prakash Sep 21 '17 at 20:01
  • @OmPrakash The downvotes are because you are cluttering up this site with questions that are easily answered by a simple google search. ["what does lambda do in sorted python"](https://stackoverflow.com/questions/8966538/syntax-behind-sortedkey-lambda) – o-90 Sep 21 '17 at 20:02
  • Your question is too vague. For example it is not clear whether you don't understand the concept of lambda functions at all, in which case a downvote for not reading the documentation would be in order, or whether you don't know why this specific lambda function is applied in the code you provided, or whether you don't understand how the specific lambda function sorts your input. In any case, a downvote for an unclear and properly poorly researched question is warranted. We don't want more questions like this on the site, that's because we downvote instead of answering. – timgeb Sep 21 '17 at 20:03
  • @gobrewers14, My apology if the question was not clear. Question is not about the `lambda` of python or `zip()`, but it is about how internally code is working. – Om Prakash Sep 21 '17 at 20:05

1 Answers1

3

This is basically specifying that the key used as a basis for sorting is -1*idx[1] . Now each element of word_count contains a word followed by it's frequency. So when you write idx[1] you are accessing the frequency, which is being used as the basis to sort the array. I think the reason he is multiplying it by -1 is becaused sorted() by default sorts in ascending order, so if you multiply a list of +ve numbers by -1 and sort them in ascending order, you get the original list in descending order, which is exactly what you want, list of words in descending order of their frequency.

You can read more about using lambda and key in sorted on this page.

Gambit1614
  • 8,547
  • 1
  • 25
  • 51