If I understand it correctly, you are trying to treat sentences as a set of POS tags.
In your example, the sentence "My name is XYZ" would be represented as a set of (PRP$, NN, VBZ, NNP).
That would mean, every sentence is actually a binary vector of length 37 (because there are 36 possible POS tags according to this page + the CLASS outcome feature for the whole sentence)
This can be encoded for OpenNLP Maxent as follows:
PRP$=1 NN=1 VBZ=1 NNP=1 CLASS=SomeClassOfYours1
or simply:
PRP$ NN VBZ NNP CLASS=SomeClassOfYours1
(For working code-snippet see my answer here: Training models using openNLP maxent)
Some more sample data would be:
- "By 1978, Radio City had lost its glamour, and the owners of Rockefeller Center decided to demolish the aging hall."
- "In time he was entirely forgotten, many of his buildings were demolished, others insensitively altered."
- "As soon as she moved out, the mobile home was demolished, the suit said."
- ...
This would yield samples:
IN CD NNP VBD VBN PRP$ NN CC DT NNS IN TO VB VBG CLASS=SomeClassOfYours2
IN NN PRP VBD RB VBN JJ IN PRP$ NNS CLASS=SomeClassOfYours3
IN RB PRP VBD RP DT JJ NN VBN NN CLASS=SomeClassOfYours2
...
However, I don't expect that such a classification yields good results. It would be better to make use of other structural features of a sentence, such as the parse tree or dependency tree that can be obtained using e.g. Stanford parser.
Edited on 28.3.2016:
You can also use the whole sentence as a training sample. However, be aware that:
- two sentences might contain same words but have different meaning
- there is a pretty high chance of overfitting
- you should use short sentences
- you need a huge training set
According to your example, I would encode the training samples as follows:
class=CLASS My_PRP name_NN is_VBZ XYZ_NNP
...
Notice that the outcome variable comes as the first element on each line.
Here is a fully working minimal example using opennlp-maxent-3.0.3.jar
.
package my.maxent;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.zip.GZIPInputStream;
import opennlp.maxent.GIS;
import opennlp.maxent.io.GISModelReader;
import opennlp.maxent.io.SuffixSensitiveGISModelWriter;
import opennlp.model.AbstractModel;
import opennlp.model.AbstractModelWriter;
import opennlp.model.DataIndexer;
import opennlp.model.DataReader;
import opennlp.model.FileEventStream;
import opennlp.model.MaxentModel;
import opennlp.model.OnePassDataIndexer;
import opennlp.model.PlainTextFileDataReader;
public class MaxentTest {
public static void main(String[] args) throws IOException {
String trainingFileName = "training-file.txt";
String modelFileName = "trained-model.maxent.gz";
// Training a model from data stored in a file.
// The training file contains one training sample per line.
DataIndexer indexer = new OnePassDataIndexer( new FileEventStream(trainingFileName));
MaxentModel trainedMaxentModel = GIS.trainModel(100, indexer); // 100 iterations
// Storing the trained model into a file for later use (gzipped)
File outFile = new File(modelFileName);
AbstractModelWriter writer = new SuffixSensitiveGISModelWriter((AbstractModel) trainedMaxentModel, outFile);
writer.persist();
// Loading the gzipped model from a file
FileInputStream inputStream = new FileInputStream(modelFileName);
InputStream decodedInputStream = new GZIPInputStream(inputStream);
DataReader modelReader = new PlainTextFileDataReader(decodedInputStream);
MaxentModel loadedMaxentModel = new GISModelReader(modelReader).getModel();
// Now predicting the outcome using the loaded model
String[] context = {"is_VBZ", "Gaby_NNP"};
double[] outcomeProbs = loadedMaxentModel.eval(context);
String outcome = loadedMaxentModel.getBestOutcome(outcomeProbs);
System.out.println("=======================================");
System.out.println(outcome);
System.out.println("=======================================");
}
}
And some dummy training data (stored as training-file.txt
):
class=Male My_PRP name_NN is_VBZ John_NNP
class=Male My_PRP name_NN is_VBZ Peter_NNP
class=Female My_PRP name_NN is_VBZ Anna_NNP
class=Female My_PRP name_NN is_VBZ Gaby_NNP
This yields the following output:
Indexing events using cutoff of 0
Computing event counts... done. 4 events
Indexing... done.
Sorting and merging events... done. Reduced 4 events to 4.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 4
Number of Outcomes: 2
Number of Predicates: 7
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-2.772588722239781 0.5
2: ... loglikelihood=-2.4410105407571203 1.0
...
99: ... loglikelihood=-0.16111520541752372 1.0
100: ... loglikelihood=-0.15953272940719138 1.0
=======================================
class=Female
=======================================