0

I'm using the Stanford Core NLP framework 3.4.1 to construct syntactic parse trees of wikipedia sentences. After which I would like to extract out of each parse tree all of the tree fragments of certain length (i.e. at most 5 nodes), but I am having a lot of trouble figuring out how to do that without creating a new GrammaticalStructure for each sub-tree.

This is what I am using to construct the parse tree, most of the code is from TreePrint.printTreeInternal() for conll2007 format which I modified to suit my output needs:

    DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader(documentText));

    for (List<HasWord> sentence : dp) {
        StringBuilder plaintexSyntacticTree = new StringBuilder();
        String sentenceString = Sentence.listToString(sentence);

        PTBTokenizer tkzr = PTBTokenizer.newPTBTokenizer(new StringReader(sentenceString));
        List toks = tkzr.tokenize();
        // skip sentences smaller than 5 words
        if (toks.size() < 5)
            continue;
        log.info("\nTokens are: "+PTBTokenizer.labelList2Text(toks));
        LexicalizedParser lp = LexicalizedParser.loadModel(
        "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz",
        "-maxLength", "80");
        TreebankLanguagePack tlp = new PennTreebankLanguagePack();
        GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
        Tree parse = lp.apply(toks);
        GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
        Collection<TypedDependency> tdl = gs.allTypedDependencies();
        Tree it = parse.deepCopy(parse.treeFactory(), CoreLabel.factory());
        it.indexLeaves();

        List<CoreLabel> tagged = it.taggedLabeledYield();
        // getSortedDeps
        List<Dependency<Label, Label, Object>> sortedDeps = new ArrayList<Dependency<Label, Label, Object>>();
        for (TypedDependency dep : tdl) {
            NamedDependency nd = new NamedDependency(dep.gov().label(), dep.dep().label(), dep.reln().toString());
            sortedDeps.add(nd);
        }
        Collections.sort(sortedDeps, Dependencies.dependencyIndexComparator());

        for (int i = 0; i < sortedDeps.size(); i++) {
          Dependency<Label, Label, Object> d = sortedDeps.get(i);

          CoreMap dep = (CoreMap) d.dependent();
          CoreMap gov = (CoreMap) d.governor();

          Integer depi = dep.get(CoreAnnotations.IndexAnnotation.class);
          Integer govi = gov.get(CoreAnnotations.IndexAnnotation.class);

          CoreLabel w = tagged.get(depi-1);

          // Used for both course and fine POS tag fields
          String tag = PTBTokenizer.ptbToken2Text(w.tag());

          String word = PTBTokenizer.ptbToken2Text(w.word());

          if (plaintexSyntacticTree.length() > 0)
              plaintexSyntacticTree.append(' ');
          plaintexSyntacticTree.append(word+'/'+tag+'/'+govi);
        }
        log.info("\nTree is: "+plaintexSyntacticTree);
    }

In the output I need to get something of this format: word/Part-Of-Speech-tag/parentID which is compatible with the output of the Google Syntactic N-Grams

I can't see to figure out, how I could get the POS tag and parentID from the original syntactic parse tree (stored in the GrammaticalStructure as a dependency list as far as I understand) for only a subset of nodes from the original tree.

I have also seen some mentions about the HeadFinder but as far as I understand that is only useful to construct the GrammaticalStructure, whereas I am trying to use the existing one. I have also seen a somwewhat similar issue about converting GrammaticalStructure to Tree but that is still an open issue and it does not tackle the issue of sub-trees or creating a custom output. Instead of creating a tree from the GrammaticalStructure I was thinking that I could just use the node reference from the tree to get the information I need, but I am basically missing an equivalent of getNodeByIndex() which can get index by node from GrammaticalStructure.

UPDATE: I have manage to get all of the required information by using the SemanticGraph as suggested in the answer. Here is a basic snippet of code that does that:

    String documentText = value.toString();
    Properties props = new Properties();
    props.put("annotators", "tokenize,ssplit,pos,depparse");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation annotation = new Annotation(documentText);
    pipeline.annotate(annotation);
    List<CoreMap> sentences =  annotation.get(CoreAnnotations.SentencesAnnotation.class);

    if (sentences != null && sentences.size() > 0) {
        CoreMap sentence = sentences.get(0);
        SemanticGraph sg = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
        log.info("SemanticGraph: "+sg.toDotFormat());
       for (SemanticGraphEdge edge : sg.edgeIterable()) {
           int headIndex = edge.getGovernor().index();
           int depIndex = edge.getDependent().index();
           log.info("["+headIndex+"]"+edge.getSource().word()+"/"+depIndex+"/"+edge.getSource().get(CoreAnnotations.PartOfSpeechAnnotation.class));
       }
    }
Community
  • 1
  • 1

1 Answers1

0

The Google syntactic n-grams are using dependency trees rather than constituency trees. So, indeed, the only way to get that representation is by converting the tree to a dependency tree. The parent id you get from the constituency parse will be for an intermediate node, rather than another word in the sentence.

My recommendation would be to run the dependency parser annotator (annotators = tokenize,ssplit,pos,depparse), and from the resulting SemanticGraph extract all clusters of 5 neighboring nodes.

Gabor Angeli
  • 5,729
  • 1
  • 18
  • 29