2

I'm trying do build a text classification model with python and textblob, the script is runing on my server and in the future the idea is that users will be able to submit their text and it will be classified. i'm loading the training set from csv :

# -*- coding: utf-8 -*-
import sys
import codecs
sys.stdout = open('yyyyyyyyy.txt',"w");
from nltk.tokenize import word_tokenize
from textblob.classifiers import NaiveBayesClassifier
with open('file.csv', 'r', encoding='latin-1') as fp:
    cl = NaiveBayesClassifier(fp, format="csv")  

print(cl.classify("some text"))

csv is about 500 lines long (with string between 10 and 100 chars), and NaiveBayesclassifier needs about 2 minutes for training and then be able to classify my text(not sure if is normal that it need so much time, maybe is my server slow with only 512mb ram).

example of csv line :

"Oggi alla Camera con la Fondazione Italia-Usa abbiamo consegnato a 140 studenti laureati con 110 e 110 lode i diplomi del Master in Marketing Comunicazione e Made in Italy.",FI-PDL

what is not clear to me, and i cant find an answer on textblob documentation, is if there is a way to 'save' my trained classifier (so save a lot of time), because by now everytime i run the script it will train again the classifier. I'm new to text classification and machine learing so my apologize if it is a dumb question.

Thanks in advance.

Community
  • 1
  • 1
Nico
  • 6,259
  • 4
  • 24
  • 40
  • take a look at http://stackoverflow.com/questions/21107075/classification-using-movie-review-corpus-in-nltk-python/21126594#21126594 and https://github.com/alvations/bayesline – alvas Nov 24 '15 at 12:39
  • thanks but i cant understand how those links can help me :) – Nico Nov 27 '15 at 14:09
  • They examples of text classification. – alvas Nov 27 '15 at 14:38
  • yeah but my problem is not how do text classification, but how textblob library can save a classifier, or using pickle, how can i speed up the classification process after loading it :) summarizing,i can classify text pretty well in fairly long time. – Nico Nov 27 '15 at 14:47
  • ;P that's why there's a real working version of the the code in: https://github.com/alvations/bayesline/blob/master/bayesline/discriminate.py#L80 – alvas Nov 27 '15 at 15:11
  • Sorry but i still not understanding how this can be helpful in my case, can you explain better? – Nico Nov 27 '15 at 15:56
  • In my answer below im pickling classifier like in your link, but one loaded it need about 1 minutes for classify a sentences of 100 chars, is that normal? – Nico Nov 27 '15 at 16:10

1 Answers1

6

Ok found that pickle module is what i need :)

Training:

# -*- coding: utf-8 -*-
import pickle
from nltk.tokenize import word_tokenize
from textblob.classifiers import NaiveBayesClassifier
with open('file.csv', 'r', encoding='latin-1') as fp:
    cl = NaiveBayesClassifier(fp, format="csv")  

object = cl
file = open('classifier.pickle','wb') 
pickle.dump(object,file)

extracting:

import pickle
sys.stdout = open('demo.txt',"w");
from nltk.tokenize import word_tokenize
from textblob.classifiers import NaiveBayesClassifier
cl = pickle.load( open( "classifier.pickle", "rb" ) )
print(cl.classify("text to classify"))
Nico
  • 6,259
  • 4
  • 24
  • 40
  • Hi can you please give me the idea, about updating the current model with other csv file – user1632980 Jul 18 '16 at 12:37
  • Hi sorry, but at the end, i stop used this library i moved to scikit learn, much stronger and flexible library for my needs. – Nico Jul 18 '16 at 12:43