The following code is supposed to create a new (modified) version of a frequency distribution (nltk.FreqDist). Both variables should then be the same length.
It works fine when a single instance of WebText is created. But when multiple WebText instances are created, then the new variable seems to be shared by all the objects.
For example:
import nltk
from operator import itemgetter
class WebText:
freq_dist_weighted = {}
def __init__(self, text):
tokens = nltk.wordpunct_tokenize(text) #tokenize
word_count = len(tokens)
freq_dist = nltk.FreqDist(tokens)
for word,frequency in freq_dist.iteritems():
self.freq_dist_weighted[word] = frequency/word_count*frequency
print len(freq_dist), len(self.freq_dist_weighted)
text1 = WebText("this is a test")
text2 = WebText("this is another test")
text3 = WebText("a final sentence")
results in
4 4
4 5
3 7
Which is incorrect. Since I am just transposing and modifying values, there should be the same numbers in each column. If I reset the freq_dist_weighted just before the loop, it works fine:
import nltk
from operator import itemgetter
class WebText:
freq_dist_weighted = {}
def __init__(self, text):
tokens = nltk.wordpunct_tokenize(text) #tokenize
word_count = len(tokens)
freq_dist = nltk.FreqDist(tokens)
self.freq_dist_weighted = {}
for word,frequency in freq_dist.iteritems():
self.freq_dist_weighted[word] = frequency/word_count*frequency
print len(freq_dist), len(self.freq_dist_weighted)
text1 = WebText("this is a test")
text2 = WebText("this is another test")
text3 = WebText("a final sentence")
results in (correct):
4 4
4 4
3 3
This doesn't make sense to me.
I don't see why I would have to reset it, since it's isolated within the objects. Am I doing something wrong?