[Project stack : Java, Opennlp, Elasticsearch (datastore) , twitter4j to read data from twitter]
I intend to use maxent classifier to classify tweets. I understand that the initial step is to train the model. From the documentation I found that we have a GISTrainer based train method to train the model. I have managed to put together a simple piece of code which makes use of opennlp's maxent classifier to train the model and predict the outcome.
I have used two files postive.txt and negative.txt to train the model
Contents of positive.txt
positive This is good
positive This is the best
positive This is fantastic
positive This is super
positive This is fine
positive This is nice
Contents of negative.txt
negative This is bad
negative This is ugly
negative This is the worst
negative This is worse
negative This sucks
And the java methods below generate the outcome.
@Override
public void trainDataset(String source, String destination) throws Exception {
File[] inputFiles = FileUtil.buildFileList(new File(source)); // trains both positive and negative.txt
File modelFile = new File(destination);
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
CategoryDataStream ds = new CategoryDataStream(inputFiles, tokenizer);
int cutoff = 5;
int iterations = 100;
BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
DoccatModel model = DocumentCategorizerME.train("en", ds, cutoff,iterations, bowfg);
model.serialize(new FileOutputStream(modelFile));
}
@Override
public void predict(String text, String modelFile) {
InputStream modelStream = null;
try{
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize(text);
modelStream = new FileInputStream(modelFile);
DoccatModel model = new DoccatModel(modelStream);
BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
DocumentCategorizer categorizer = new DocumentCategorizerME(model, bowfg);
double[] probs = categorizer.categorize(tokens);
if(null!=probs && probs.length>0){
for(int i=0;i<probs.length;i++){
System.out.println("double[] probs index " + i + " value " + probs[i]);
}
}
String label = categorizer.getBestCategory(probs);
System.out.println("label " + label);
int bestIndex = categorizer.getIndex(label);
System.out.println("bestIndex " + bestIndex);
double score = probs[bestIndex];
System.out.println("score " + score);
}
catch(Exception e){
e.printStackTrace();
}
finally{
if(null!=modelStream){
try {
modelStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) {
try {
String outputModelPath = "/home/**/sd-sentiment-analysis/models/trainPostive";
String source = "/home/**/sd-sentiment-analysis/sd-core/src/main/resources/datasets/";
MaximunEntropyClassifier me = new MaximunEntropyClassifier();
me.trainDataset(source, outputModelPath);
me.predict("This is bad", outputModelPath);
} catch (Exception e) {
e.printStackTrace();
}
}
I have the following questions.
1) How do I iteratively train a model? Also, how do I add new sentences/words to the model ? Is there a specific format for the data file? I found that the file needs to have a minimum of two words separated by a tab. Is my understanding valid? 2) Are there any publicly available data sets that I can use to train the model? I found some sources for movie reviews. The project I'm working on involves not just movie reviews but also other things such as product reviews, brand sentiments etc. 3) This helps to an extent. Is there a working example somewhere publicly available? I couldn't find the documentation for maxent.
Please help me out. I am kind'a blocked on this.