1

I'm trying to utilize NLTK to perform term frequency (TF) and inverse document frequency (IDF) analyses on a batch of files (they happen to be corporate press releases from IBM). I know that the assertion of whether or not NLTK has TF IDF capabilities has been disputed on SO beforehand, but I've found docs indicating the module does have them:

http://www.nltk.org/_modules/nltk/text.html

http://www.nltk.org/api/nltk.html#nltk.text.TextCollection

I've never seen or used "self" or init to execute code beforehand. This is what I have so far. Any advice on how to amend this code so it works is very much appreciated. What I currently have doesn't return anything. I don't really understand what "source," "self" or "term" and "text" in the NLTK docs represent.

import nltk.corpus
from nltk.text import TextCollection
from nltk.corpus import gutenberg
gutenberg.fileids()

ibm1 = gutenberg.words('ibm-github.txt')
ibm2 = gutenberg.words('ibm-alior.txt')

mytexts = TextCollection([ibm1, ibm2])
term = 'software'

def __init__(self, source):
    if hasattr(source, 'words'):
        source = [source.words(f) for f in source.fileids()]

    self._texts = source
    Text.__init__(self, LazyConcatenation(source))
    self._idf_cache = {}

def tf(self, term, mytexts):
    result = mytexts.count(term) / len(mytexts)
    print(result)
Community
  • 1
  • 1
dataelephant
  • 563
  • 2
  • 7
  • 21

1 Answers1

2
from nltk.text import TextCollection
from nltk.book import text1, text2, text3

mytexts = TextCollection([text1, text2, text3])

# Print the IDF of a word
print(mytexts.idf("Moby"))

# tf_idf
print(mytexts.tf_idf("Moby", text1))