7

I'm making a simple sentiment mining system using a Naive Bayes classifier.

For training my classifier, I have a text file where each line contains a list of tokens (generated from a tweet), and the associated sentiment (0 for -ve, 4 for positive).

For example:

0 @ switchfoot http : //twitpic.com/2y1zl - Awww , that 's a bummer . You shoulda got David Carr of Third Day to do it . ; D
0 spring break in plain city ... it 's snowing
0 @ alydesigns i was out most of the day so did n't get much done
0 some1 hacked my account on aim now i have to make a new one
0 really do n't feel like getting up today ... but got to study to for tomorrows practical exam ...

Now, what I'm trying to do is for each token, count how many times it occurs in a positive tweet, and how many times it occurs in a negative tweet. I then plan to use these counts for calculating probabilities. I'm using the built-in dictionary for storing these counts. The keys are the tokens and the values are integer arrays of size 2.

The problem is that this code starts off pretty fast, but keeps getting slower and when it has processed around 200 thousand tweets, it gets really slow - around 1 tweet per second. Since my training set has 1.6 million tweets, this is too slow. The code I have is this:

def compute_counts(infile):
    f = open(infile)
    counts = {}
    i = 0
    for line in f:
        i = i + 1
        print(i)
        words = line.split(' ')
        for word in words[1:]:
            word = word.replace('\n', '').replace('\r', '')
            if words[0] == '0':
                if word in counts.keys():
                    counts[word][0] += 1
                else:
                    counts[word] = [1, 0]
            else:
                if word in counts.keys():
                    counts[word][1] += 1
                else:
                    counts[word] = [0, 1]
    return counts

What can I do to make this process faster? A better data structure?

Edit: Not a duplicate, the question is not about something faster than dict in the general case, but in this specific use case.

hoodakaushal
  • 1,253
  • 2
  • 16
  • 31
  • Instead of checking key is exist or not, use counts[key] = counts.get(key, default=None) # you can provide default value in case key is not present it will created with default value. – Hemant Gangwar Sep 14 '14 at 13:37
  • 2
    You could use two `collections.Counter` instead of one dictionary of lists. – Yann Vernier Sep 14 '14 at 13:38

1 Answers1

12

Don't use if word in counts.keys() If you do that, you end up looking sequentially through the keys, which is what dict is supposed to avoid.

Just put if word in counts.

Or use a defaultdict. https://docs.python.org/2/library/collections.html#collections.defaultdict

khelwood
  • 55,782
  • 14
  • 81
  • 108
  • 1
    In Python 2, `dict.keys()` creates a list, an operation which is probably as expensive as the search. It's not the dictionary being slow. – Yann Vernier Sep 14 '14 at 13:35
  • defaultdict worked like a charm. Earlier I took ~4 hours to process 200k lines, but now the whole 1.6 million done within a minute. Thanks! – hoodakaushal Sep 14 '14 at 13:38