I'm using the Stanford Core NLP framework 3.4.1 to construct syntactic parse trees of wikipedia sentences. After which I would like to extract out of each parse tree all of the tree fragments of certain length (i.e. at most 5 nodes), but I am having a lot of trouble figuring out how to do that without creating a new GrammaticalStructure for each sub-tree.
This is what I am using to construct the parse tree, most of the code is from TreePrint.printTreeInternal() for conll2007 format which I modified to suit my output needs:
DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader(documentText));
for (List<HasWord> sentence : dp) {
StringBuilder plaintexSyntacticTree = new StringBuilder();
String sentenceString = Sentence.listToString(sentence);
PTBTokenizer tkzr = PTBTokenizer.newPTBTokenizer(new StringReader(sentenceString));
List toks = tkzr.tokenize();
// skip sentences smaller than 5 words
if (toks.size() < 5)
continue;
log.info("\nTokens are: "+PTBTokenizer.labelList2Text(toks));
LexicalizedParser lp = LexicalizedParser.loadModel(
"edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz",
"-maxLength", "80");
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
Tree parse = lp.apply(toks);
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection<TypedDependency> tdl = gs.allTypedDependencies();
Tree it = parse.deepCopy(parse.treeFactory(), CoreLabel.factory());
it.indexLeaves();
List<CoreLabel> tagged = it.taggedLabeledYield();
// getSortedDeps
List<Dependency<Label, Label, Object>> sortedDeps = new ArrayList<Dependency<Label, Label, Object>>();
for (TypedDependency dep : tdl) {
NamedDependency nd = new NamedDependency(dep.gov().label(), dep.dep().label(), dep.reln().toString());
sortedDeps.add(nd);
}
Collections.sort(sortedDeps, Dependencies.dependencyIndexComparator());
for (int i = 0; i < sortedDeps.size(); i++) {
Dependency<Label, Label, Object> d = sortedDeps.get(i);
CoreMap dep = (CoreMap) d.dependent();
CoreMap gov = (CoreMap) d.governor();
Integer depi = dep.get(CoreAnnotations.IndexAnnotation.class);
Integer govi = gov.get(CoreAnnotations.IndexAnnotation.class);
CoreLabel w = tagged.get(depi-1);
// Used for both course and fine POS tag fields
String tag = PTBTokenizer.ptbToken2Text(w.tag());
String word = PTBTokenizer.ptbToken2Text(w.word());
if (plaintexSyntacticTree.length() > 0)
plaintexSyntacticTree.append(' ');
plaintexSyntacticTree.append(word+'/'+tag+'/'+govi);
}
log.info("\nTree is: "+plaintexSyntacticTree);
}
In the output I need to get something of this format: word/Part-Of-Speech-tag/parentID which is compatible with the output of the Google Syntactic N-Grams
I can't see to figure out, how I could get the POS tag and parentID from the original syntactic parse tree (stored in the GrammaticalStructure as a dependency list as far as I understand) for only a subset of nodes from the original tree.
I have also seen some mentions about the HeadFinder but as far as I understand that is only useful to construct the GrammaticalStructure, whereas I am trying to use the existing one. I have also seen a somwewhat similar issue about converting GrammaticalStructure to Tree but that is still an open issue and it does not tackle the issue of sub-trees or creating a custom output. Instead of creating a tree from the GrammaticalStructure I was thinking that I could just use the node reference from the tree to get the information I need, but I am basically missing an equivalent of getNodeByIndex() which can get index by node from GrammaticalStructure.
UPDATE: I have manage to get all of the required information by using the SemanticGraph as suggested in the answer. Here is a basic snippet of code that does that:
String documentText = value.toString();
Properties props = new Properties();
props.put("annotators", "tokenize,ssplit,pos,depparse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation(documentText);
pipeline.annotate(annotation);
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
if (sentences != null && sentences.size() > 0) {
CoreMap sentence = sentences.get(0);
SemanticGraph sg = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
log.info("SemanticGraph: "+sg.toDotFormat());
for (SemanticGraphEdge edge : sg.edgeIterable()) {
int headIndex = edge.getGovernor().index();
int depIndex = edge.getDependent().index();
log.info("["+headIndex+"]"+edge.getSource().word()+"/"+depIndex+"/"+edge.getSource().get(CoreAnnotations.PartOfSpeechAnnotation.class));
}
}