0

I am trying to process a big corpus of tweets (1,600,000, can be found here) with the following code to train a Naive Bayes Classifier in order to play around with sentiment analysis.

My problem is I never coded anything that ever had to handle much memory or big variables.

At the moment the script runs for a while and then after a couple hours I get a runtime error (I'm on a Windows machine). I belive I'm not managing the list objects properly.

I am successfully running the program while limiting the for cycle but that means limiting the training set and quite likely getting worse sentiment analysis results.

How can I process the whole corpus? How can I better manage those lists? Are really those the ones causing the problem?

These are the imports

import pickle
import re
import os, errno
import csv
import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier

Here I load the corpora and create the lists where I want to store the features I extract from the corpus

inpTweets = csv.reader(open('datasets/training.1600000.processed.noemoticon.csv', 'rb'), delimiter=',', quotechar='"')
tweets = []
featureList = []
n=0

This for cycle extracts the stuff from the corpora and thanks to processTweet(), a long algorithm, I extract the features from each row of the .CSV

for row in inpTweets:
    sentiment = row[0]
    status_text = row[5]
    featureVector = processTweet(status_text.decode('utf-8')) 
    #to know it's doing something
    n = n + 1
    print n
    #we'll need both the featurelist and the tweets variable, carrying tweets and sentiments

Here I extend/append the lists / the variables to the lists, we're still inside the for cycle.

    featureList.extend(featureVector)  
    tweets.append((featureVector, sentiment))              

When the cycle ends I get rid of duplicates in the featureList and save it to a pickle.

featureList = list(set(featureList))
flist = open('fList.pickle', 'w')
pickle.dump(featureList, flist)
flist.close()

I get the features ready for the classifier.

training_set = nltk.classify.util.apply_features(extract_features, tweets)

Then I train the classifier and save it to a pickle.

# Train the Naive Bayes classifier
print "\nTraining the classifier.."
NBClassifier = nltk.NaiveBayesClassifier.train(training_set)
fnbc = open('nb_classifier.pickle', 'w')
pickle.dump(NBClassifier, fnbc)
fnbc.close()

edit: 19:45 gmt+1 - forgot to add n=0 in this post.

edit1: Due to lack of time and computing power limitations I choose to reduce the corpus like this -

.....
n=0
i=0
for row in inpTweets:
    i = i+1
    if (i==160):         #limiter
        i = 0
        sentiment = row[0]
        status_text = row[5]  
        n = n + 1
.....

As in the end the classifier was taking ages to train. About the runtime error please see the comments. Thanks everyone for the help.

  • 1
    What runtime error do you get? And where? – AkiRoss Feb 01 '15 at 18:06
  • I did not take a note of it, sorry. At the moment I'm running the program again in order to reproduce and give you the exact error. Thanks for helping tho. The error happens at some point during the for cycle. It happens after many many cycles have been successfully executed. – albertogloder Feb 01 '15 at 18:43
  • Problems can be anywhere. Could be insufficient memory, could be a bug due to an error in the dataset (and may be located near the end of the dataset, for this reason you won't see it by limiting the loop), etc. To avoid memory issues, try to unload the memory from time to time, e.g. write to file every 10K rows. Also, try to do your processing of the dataset by chunk: divide it in segments and process one at time, saving partial results. – AkiRoss Feb 01 '15 at 18:56
  • use `panda` for bigger than example data – alvas Feb 02 '15 at 07:35
  • I found what was giving the runtime error at some point when many loops were already executed successfully. In the previous version to the one I posted instead of counting every cycle I was printing every featureVector (that was the only difference from this code). It was very fancy to see but in my opinion that was what was causing runtime errors. Probably because the python shell was containing way too many items. After I switched to the version I posted I could run the program successfully. My apologies as I recalled the change only many hours after the post. – albertogloder Feb 03 '15 at 23:46
  • Still my computer could not train the classifier (1day and it was still running, I killed it). As the classifier was taking ages to train I simply limited the loop like this n = 0 i = 0 for row in inpTweets: i = i+1 if (i==1590): #limiter n = n + 1 – albertogloder Feb 03 '15 at 23:49

1 Answers1

0

You could use csv.field_size_limit(int)

For example:

f = open('datasets/training.1600000.processed.noemoticon.csv', 'rb')
csv.field_size_limit(100000)
inpTweets = csv.reader(f, delimiter=',', quotechar='"')

You can try changing the value 100,000 to something better maybe.

+1 on the comment about Pandas.

Also, you might want to check out cPickle here. (1000x faster)


Check out this question / answer too !

Another relevant blog post here.

Community
  • 1
  • 1
metasyn
  • 194
  • 1
  • 8
  • Thank you! I didn't know I could limit csv like that. That will surely come handy. However my dataset was split in half between positive and negative tweets, first were the negative, then the positive. So I put a condition in my for loop to only execute the code every N entries. – albertogloder Feb 03 '15 at 23:52