I'm trying to utilize NLTK to perform term frequency (TF) and inverse document frequency (IDF) analyses on a batch of files (they happen to be corporate press releases from IBM). I know that the assertion of whether or not NLTK has TF IDF capabilities has been disputed on SO beforehand, but I've found docs indicating the module does have them:
http://www.nltk.org/_modules/nltk/text.html
http://www.nltk.org/api/nltk.html#nltk.text.TextCollection
I've never seen or used "self" or init to execute code beforehand. This is what I have so far. Any advice on how to amend this code so it works is very much appreciated. What I currently have doesn't return anything. I don't really understand what "source," "self" or "term" and "text" in the NLTK docs represent.
import nltk.corpus
from nltk.text import TextCollection
from nltk.corpus import gutenberg
gutenberg.fileids()
ibm1 = gutenberg.words('ibm-github.txt')
ibm2 = gutenberg.words('ibm-alior.txt')
mytexts = TextCollection([ibm1, ibm2])
term = 'software'
def __init__(self, source):
if hasattr(source, 'words'):
source = [source.words(f) for f in source.fileids()]
self._texts = source
Text.__init__(self, LazyConcatenation(source))
self._idf_cache = {}
def tf(self, term, mytexts):
result = mytexts.count(term) / len(mytexts)
print(result)