If you really are pulling in 4 million15,000 features to analyze maybe a dozen words, most of the features won't be used. This suggests using some sort of disk-based database for the features instead, and pulling in only the ones you need. Even for a long sentence and an inefficient database, 4 seeks x 50 words is still way less than what you see now -- maybe hundreds of milliseconds in the worst case, but certainly not multiple seconds.
Look at anydbm with an NDBM or GDBM back-end for a start, then maybe consider other back-ends depending on familiarity and availability.
Your follow-up comments seem to suggest a basic misunderstanding of what you are doing and/or how things are supposed to work. Let's make a simple example with five words in the lexicon.
# training
d = { 'good': 1, 'bad': -1, 'excellent': 1, 'poor': -1, 'great': 1 }
c = classifier(d)
with open(f, "classifier.pickle", "w") as f:
pickle.dump(c, f)
sentences = ['I took a good look', 'Even his bad examples were stunning']
# classifying, stupid version
for sentence in sentences:
with open(f, "classifier.pickle", "r") as f:
c = pickle.load(f)
sentiment = c(sentence)
# basically, for word in sentence.split(): if word in d: sentiment += d[word]
print sentiment, sentence
# classifying, slightly less stupid version
with open(f, "classifier.pickle", "r") as f:
c = pickle.load(f)
# FastCGI init_end here
for sentence in sentences:
sentiment = c(sentence)
print sentiment, sentence
The stupid version appears to be what you are currently experiencing. The slightly less stupid version loads the classifier once, and then runs it on each of the input sentences. This is what FastCGI will do for you: you can do the loading part in the process start-up once, and then have a service running which runs it on input sentences as they come in. This is resource-efficient but a bit of work, because converting your script to FastCGI and setting up the server infrastructure is a hassle. If you expect heavy use, it's definitely the way to go.
But observe that only two features out of the five in the model are actually ever needed. Most of the words in the sentences do not have a sentiment score, and most of the words in the sentiments database are not required to calculate a score for these inputs. So a database implementation would instead look something like (rough pseudocode for the DBM part)
with opendbm("sentiments.db") as d:
for sentence in sentences:
sentiment = 0
for word in sentence.split():
try:
sentiment += d[word]
except KeyError:
pass
print sentiment, sentence
The cost per transaction is higher, so it is less optimal than the FastCGI version, which only loads the whole model into memory at start-up; but it does not require you to keep state or set up the FastCGI infrastructure, and it is a lot more efficient than the stupid version which loads the entire model for each sentence.
(In reality, for a web service without FastCGI, you would effectively have the opendbm
inside the for
instead of the other way around.)