TF/IDF in Python

Question

Need to calculate TF/IDF for all possible n-terms for a corpus (corpus is not big, can be processed in local machine), using Python 2.7 and wondering if any reference implementation or library I can use directly? Thanks.

regards, Lin

@m9_psy, thanks and vote up. Do you mean using this API (https://radimrehurek.com/gensim/models/tfidfmodel.html)? Not sure if gensim requires I have a dictionary in advance, my requirement is, I do not have a dictionary, I just need to calculate TF/IDF for all possible n-terms. Your advice is appreciated. :) — Lin Ma, Jun 13 '16 at 00:37
No, you do not need the dictionary - it will be built in the process. For detecting phrases (n-grams) there is separate module: https://radimrehurek.com/gensim/models/phrases.html — m9_psy, Jun 13 '16 at 00:39
Thanks @m9_psy, from the sample you referred to me, it only works for bi-gram? `bigram = Phrases(sentence_stream)`? — Lin Ma, Jun 13 '16 at 22:29
No, it can hadle phrases with any length and docs i referred contains exact recipe for this. — m9_psy, Jun 14 '16 at 00:42
Thanks @m9_psy, vote up for your reply. I am referring to this page (https://radimrehurek.com/gensim/models/phrases.html), and it is said the library is used to find "frequently co-occurring tokens", my need is to fine TF/IDF values and find high TF/IDF value n-grams, and "frequently co-occurring tokens" seems not TF/IDF, wondering your comments and please feel free to correct me. :) — Lin Ma, Jun 14 '16 at 23:19

score 1 · Accepted Answer · answered Jun 13 '16 at 00:29

1

scikit-learn solves this issue.

http://scikit-learn.org/stable/modules/feature_extraction.html

answered Jun 13 '16 at 00:29

dmitryro

3,463
2
20
28

Thanks user3358074, vote up for your reply and I think to use the reference you pointed out from scikit, I need to have a dictionary in advance? My requirement is, I do not have a dictionary, I just need to calculate TF/IDF for all possible n-terms. Your advice is appreciated. :) – Lin Ma Jun 13 '16 at 00:34
1

This will probably require your corpus and then the rest is like here: http://stackoverflow.com/questions/23792781/tf-idf-feature-weights-using-sklearn-feature-extraction-text-tfidfvectorizer – dmitryro Jun 13 '16 at 00:47
Thanks dmitryro, vote up for your reply. Just to confirm my understanding is correct, (1) for `corpus` you mean the raw documents/files I have, where I want to generate TF/IDF of n-grams for, other than a word dictionary file? (2) do you know if scikit-learn work for unicode like Chinese, Japanese characters as well (suppose unicode encoded)? – Lin Ma Jun 13 '16 at 22:33
BTW, another question (3) is I do not see where to specify parameter n (n-gram), currently I want to set n to be no more than 4 (i.e. I want to consider 1-gram, 2-gram, 3-gram and 4-gram, for their TF/IDF values). Thanks. – Lin Ma Jun 13 '16 at 22:35
Thanks dmitryro, read the link you referred (http://scikit-learn.org/stable/modules/feature_extraction.html), but I cannot find an example, where to input corpus (cannot find either from the other stackoverfow post you referred)? If you could provide more information, it will be great. – Lin Ma Jun 17 '16 at 18:36
Thanks for all the help dmitryro, mark your reply as answer. :) – Lin Ma Jul 27 '16 at 23:19

TF/IDF in Python

1 Answers1