Does NLTK have TF-IDF implemented?

Question

There are TF-IDF implementations in scikit-learn and gensim.

There are simple implementations Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

To avoid reinventing the wheel,

Is there really no TF-IDF in NLTK?
Are there sub-packages that we can manipulate to implement TF-IDF in NLTK? If there are how?

In this blogpost, it says NLTK doesn't have it. Is that true? http://www.bogotobogo.com/python/NLTK/tf_idf_with_scikit-learn_NLTK.php

Hm, I didn't try tf_idf. Moreover, google can't find tf_idf in the name of function. Double fail) — Nikita Astrakhantsev, Apr 10 '15 at 22:39

score 10 · Accepted Answer · answered Apr 10 '15 at 20:51

10

The NLTK TextCollection class has a method for computing the tf-idf of terms. The documentation is here, and the source is here. However, it says "may be slow to load", so using scikit-learn may be preferable.

answered Apr 10 '15 at 20:51

yvespeirsman

3,099
20
21

1

From https://github.com/nltk/nltk/blob/develop/nltk/text.py#L566, this looks expensive: `len([True for text in self._texts if term in text])` – alvas Apr 10 '15 at 21:56
1

At least, now we find a spot that we should optimize. If that loop becomes a real cheap operation, we might get some hope =) – alvas Apr 10 '15 at 21:59

score 4 · Answer 2 · edited May 23 '17 at 12:09

I guess, there are enough evidences to conclude non-existence of TF-IDF in NLTK:

Unfortunately, calculating tf-idf is not available in NLTK so we'll use another data analysis library, scikit-learn

from COMPSCI 290-01 Spring 2014 lab
More important, source code contains nothing related to tfidf (or tf-idf). Exceptions are NLTK-contrib, which contains map-reduce implementation for TF-IDF.

There are several libs for tf-idf mentioned in related question.

Upd: search by tf idf or tf_idf lets to find the function already found by @yvespeirsman

Does NLTK have TF-IDF implemented?

2 Answers2

Linked