0

Need to calculate TF/IDF for all possible n-terms for a corpus (corpus is not big, can be processed in local machine), using Python 2.7 and wondering if any reference implementation or library I can use directly? Thanks.

regards, Lin

Lin Ma
  • 9,739
  • 32
  • 105
  • 175
  • 1
    Try [gensim](https://radimrehurek.com/gensim/) – m9_psy Jun 13 '16 at 00:25
  • @m9_psy, thanks and vote up. Do you mean using this API (https://radimrehurek.com/gensim/models/tfidfmodel.html)? Not sure if gensim requires I have a dictionary in advance, my requirement is, I do not have a dictionary, I just need to calculate TF/IDF for all possible n-terms. Your advice is appreciated. :) – Lin Ma Jun 13 '16 at 00:37
  • 1
    No, you do not need the dictionary - it will be built in the process. For detecting phrases (n-grams) there is separate module: https://radimrehurek.com/gensim/models/phrases.html – m9_psy Jun 13 '16 at 00:39
  • Thanks @m9_psy, from the sample you referred to me, it only works for bi-gram? `bigram = Phrases(sentence_stream)`? – Lin Ma Jun 13 '16 at 22:29
  • 1
    No, it can hadle phrases with any length and docs i referred contains exact recipe for this. – m9_psy Jun 14 '16 at 00:42
  • Thanks @m9_psy, vote up for your reply. I am referring to this page (https://radimrehurek.com/gensim/models/phrases.html), and it is said the library is used to find "frequently co-occurring tokens", my need is to fine TF/IDF values and find high TF/IDF value n-grams, and "frequently co-occurring tokens" seems not TF/IDF, wondering your comments and please feel free to correct me. :) – Lin Ma Jun 14 '16 at 23:19

1 Answers1

1

scikit-learn solves this issue.

http://scikit-learn.org/stable/modules/feature_extraction.html

dmitryro
  • 3,463
  • 2
  • 20
  • 28
  • Thanks user3358074, vote up for your reply and I think to use the reference you pointed out from scikit, I need to have a dictionary in advance? My requirement is, I do not have a dictionary, I just need to calculate TF/IDF for all possible n-terms. Your advice is appreciated. :) – Lin Ma Jun 13 '16 at 00:34
  • 1
    This will probably require your corpus and then the rest is like here: http://stackoverflow.com/questions/23792781/tf-idf-feature-weights-using-sklearn-feature-extraction-text-tfidfvectorizer – dmitryro Jun 13 '16 at 00:47
  • Thanks dmitryro, vote up for your reply. Just to confirm my understanding is correct, (1) for `corpus` you mean the raw documents/files I have, where I want to generate TF/IDF of n-grams for, other than a word dictionary file? (2) do you know if scikit-learn work for unicode like Chinese, Japanese characters as well (suppose unicode encoded)? – Lin Ma Jun 13 '16 at 22:33
  • BTW, another question (3) is I do not see where to specify parameter n (n-gram), currently I want to set n to be no more than 4 (i.e. I want to consider 1-gram, 2-gram, 3-gram and 4-gram, for their TF/IDF values). Thanks. – Lin Ma Jun 13 '16 at 22:35
  • Thanks dmitryro, read the link you referred (http://scikit-learn.org/stable/modules/feature_extraction.html), but I cannot find an example, where to input corpus (cannot find either from the other stackoverfow post you referred)? If you could provide more information, it will be great. – Lin Ma Jun 17 '16 at 18:36
  • Thanks for all the help dmitryro, mark your reply as answer. :) – Lin Ma Jul 27 '16 at 23:19