Creating training data for a Maxent classfier in Java

Question

I am trying to create the java implementation for maxent classifier. I need to classify the sentences into n different classes.

I had a look at ColumnDataClassifier in stanford maxent classifier. But I am not able to understand how to create training data. I need training data in the form where training data includes POS Tags for words for sentence, so that the features used for classifier will be like previous word, next word etc.

I am looking for training data which has sentences with POS TAGGING and sentence class mentioned. example :

My/(POS) name/(POS) is/(POS) XYZ/(POS) CLASS

Any help will be appreciated.

I am looking for training data which has sentences with POS TAGGING and sentence class mentioned. example : My/(POS) name/(POS) is/(POS) XYZ/(POS) CLASS — Ankit Bansal, Mar 16 '16 at 12:05
Are these Q&A useful? http://stackoverflow.com/questions/28601653/how-do-we-get-run-stanford-classifier-on-an-array-of-strings and http://stackoverflow.com/questions/31091082/how-to-use-pos-tag-as-a-feature-for-training-data-by-naive-bayes-classifier — vcp, Mar 16 '16 at 12:46
the second link though seems good. but still i am not geting how to create the training set and how to then use maxent classifier to actually use that training data. — Ankit Bansal, Mar 17 '16 at 04:35

score 2 · Accepted Answer · edited May 23 '17 at 12:22

If I understand it correctly, you are trying to treat sentences as a set of POS tags.

In your example, the sentence "My name is XYZ" would be represented as a set of (PRP$, NN, VBZ, NNP). That would mean, every sentence is actually a binary vector of length 37 (because there are 36 possible POS tags according to this page + the CLASS outcome feature for the whole sentence)

This can be encoded for OpenNLP Maxent as follows:

PRP$=1 NN=1 VBZ=1 NNP=1 CLASS=SomeClassOfYours1

or simply:

PRP$ NN VBZ NNP CLASS=SomeClassOfYours1

(For working code-snippet see my answer here: Training models using openNLP maxent)

Some more sample data would be:

"By 1978, Radio City had lost its glamour, and the owners of Rockefeller Center decided to demolish the aging hall."
"In time he was entirely forgotten, many of his buildings were demolished, others insensitively altered."
"As soon as she moved out, the mobile home was demolished, the suit said."
...

This would yield samples:

IN CD NNP VBD VBN PRP$ NN CC DT NNS IN TO VB VBG CLASS=SomeClassOfYours2
IN NN PRP VBD RB VBN JJ IN PRP$ NNS CLASS=SomeClassOfYours3
IN RB PRP VBD RP DT JJ NN VBN NN CLASS=SomeClassOfYours2
...

However, I don't expect that such a classification yields good results. It would be better to make use of other structural features of a sentence, such as the parse tree or dependency tree that can be obtained using e.g. Stanford parser.

Edited on 28.3.2016: You can also use the whole sentence as a training sample. However, be aware that: - two sentences might contain same words but have different meaning - there is a pretty high chance of overfitting - you should use short sentences - you need a huge training set

According to your example, I would encode the training samples as follows:

class=CLASS My_PRP name_NN is_VBZ XYZ_NNP
...

Notice that the outcome variable comes as the first element on each line.

Here is a fully working minimal example using opennlp-maxent-3.0.3.jar.

package my.maxent;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.zip.GZIPInputStream;

import opennlp.maxent.GIS;
import opennlp.maxent.io.GISModelReader;
import opennlp.maxent.io.SuffixSensitiveGISModelWriter;
import opennlp.model.AbstractModel;
import opennlp.model.AbstractModelWriter;
import opennlp.model.DataIndexer;
import opennlp.model.DataReader;
import opennlp.model.FileEventStream;
import opennlp.model.MaxentModel;
import opennlp.model.OnePassDataIndexer;
import opennlp.model.PlainTextFileDataReader;

public class MaxentTest {


    public static void main(String[] args) throws IOException {

        String trainingFileName = "training-file.txt";
        String modelFileName = "trained-model.maxent.gz";

        // Training a model from data stored in a file.
        // The training file contains one training sample per line.
        DataIndexer indexer = new OnePassDataIndexer( new FileEventStream(trainingFileName)); 
        MaxentModel trainedMaxentModel = GIS.trainModel(100, indexer); // 100 iterations

        // Storing the trained model into a file for later use (gzipped)
        File outFile = new File(modelFileName);
        AbstractModelWriter writer = new SuffixSensitiveGISModelWriter((AbstractModel) trainedMaxentModel, outFile);
        writer.persist();

        // Loading the gzipped model from a file
        FileInputStream inputStream = new FileInputStream(modelFileName);
        InputStream decodedInputStream = new GZIPInputStream(inputStream);
        DataReader modelReader = new PlainTextFileDataReader(decodedInputStream);
        MaxentModel loadedMaxentModel = new GISModelReader(modelReader).getModel();

        // Now predicting the outcome using the loaded model
        String[] context = {"is_VBZ", "Gaby_NNP"};
        double[] outcomeProbs = loadedMaxentModel.eval(context);

        String outcome = loadedMaxentModel.getBestOutcome(outcomeProbs);
        System.out.println("=======================================");
        System.out.println(outcome);
        System.out.println("=======================================");
    }

}

And some dummy training data (stored as training-file.txt):

class=Male      My_PRP name_NN is_VBZ John_NNP
class=Male      My_PRP name_NN is_VBZ Peter_NNP
class=Female    My_PRP name_NN is_VBZ Anna_NNP
class=Female    My_PRP name_NN is_VBZ Gaby_NNP

This yields the following output:

Indexing events using cutoff of 0
Computing event counts...  done. 4 events
Indexing...  done.
Sorting and merging events... done. Reduced 4 events to 4.
Done indexing.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 4
        Number of Outcomes: 2
      Number of Predicates: 7
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-2.772588722239781  0.5
  2:  ... loglikelihood=-2.4410105407571203 1.0
      ...
 99:  ... loglikelihood=-0.16111520541752372    1.0
100:  ... loglikelihood=-0.15953272940719138    1.0
=======================================
class=Female
=======================================

Thanks for the reply. But I am looking for training set like this My_PRP name_NN is_VBZ XYZ_NNP CLASS Meaning the sentence are also present in the training set along with the POS Tags. And then use maxent as the classifier which uses features which includes features based on the sentences words and as well as POS tags, — Ankit Bansal, Mar 28 '16 at 06:28
I've just edited my answer - added a fully working example in java + sample data. — Viliam Simko, Mar 28 '16 at 21:53
Thanks a lot. Can you tell me one more thing. What are all the features that the classifier is using now. ? Also Is it using POS tags as featues for classification or it is considering POS tags as just an other English Words. — Ankit Bansal, Apr 07 '16 at 04:41
The classifier sees just some tokens (e.g. "My_PRP" as a token) and doesn't care about their meaning. I think it internally considers every token as an index in a hashtable that maps `string` to `bool`. — Viliam Simko, Apr 07 '16 at 14:38
And what about the features that the classifier is using? and can they be customized? — Ankit Bansal, Apr 08 '16 at 10:42
Could you be more specific ? You can use tokens like "a=1" or "something". — Viliam Simko, Apr 08 '16 at 15:12
By features, I mean what are the features that the classifier is using, like useNgram, use previous words, next words combination. etc.. — Ankit Bansal, Apr 11 '16 at 07:36
can the stored model be called from within R via the openNLP package - more specifically passing a model into the model parameter `Maxent_POS_Tag_Annotator(language = "en", probs = FALSE, model = NULL)` — brucezepplin, May 18 '16 at 14:48
My example shows how to build a general-purpose MAXENT classification model. Not just for POS tagging. I think that Maxent_POS_Tag_Annotator is a special implementation that uses opennlp MAXENT in the background. The question is, how it generates/uses features that are used by the underlying MAXENT model. — Viliam Simko, May 19 '16 at 13:17

Creating training data for a Maxent classfier in Java

1 Answers1

Linked