3

I would like to convert the following nltk Tree representation into JSON format:

nltk Tree structure

Desired output:

{
    "scores": {
        "filler": [
            [
                "scores"
            ],
            [
                "for"
            ]
        ],
        "extent": [
            "highest"
        ],
        "team": [
            "India"
        ]
    }
}
Raj
  • 22,346
  • 14
  • 99
  • 142
  • it is not a valid JSON: there are two "team" names in the same object. JSON object is an unordered set of name/value pair. Different json parser may produce different results: a parser may preserve only the first 'team', or only the last 'team' pair, or (unlikely) [create a list `["India", "Pakistan"]`](http://stackoverflow.com/a/7828652/4279) – jfs Apr 16 '14 at 16:38
  • See also [rfc 7159](http://tools.ietf.org/html/rfc7159): *"When the names within an object are not unique, the behavior of software that receives such an object is unpredictable. Many implementations report the last name/value pair only. Other implementations report an error or fail to parse the object, and some implementations report all of the name/value pairs, including duplicates."* – jfs Apr 16 '14 at 16:44
  • again the source tree contains duplicate names `('filler', 'filler')` Why do you remove them from the output? – jfs Apr 18 '14 at 15:35
  • it got removed automatically while building the dict. Its okay to have them removed since the filler info is not required in the output. – Raj Apr 18 '14 at 15:43
  • How do you know that it is not required in the output? – jfs Apr 18 '14 at 15:45
  • I mean, it is not required for my use-case. I think your answer is the right one! – Raj Apr 18 '14 at 15:46
  • Imagine you are given an arbitrary `nltk.Tree`: how does your code know which nodes exactly it should remove from the tree? What is special about the child `'filler'` nodes that they are removed but the parent `'filler'` node is preserved? – jfs Apr 18 '14 at 15:48
  • my input Tree itself has duplicate keys. Hence converting them to dict will be erroneous and unexpected. I need to fix my input Tree to have something like filler1, filler2. Do you know how we can handle recursive feature based grammar? I am having `filler -> filler filler` and `filler -> 'is' | 'are' | 'the' | 'for' | 'by' | 'of'` to accomodate any number sequential unwanted words in my input – Raj Apr 18 '14 at 15:57
  • ok. As I understand you should not remove `'filler'` from json output if it is present in the input Tree. You should transform one `nltk.Tree` into another `nltk.Tree` (with collapsed 'filler' nodes) instead. It is unrelated to serializing `nltk.Tree` to json. – jfs Apr 18 '14 at 17:13

4 Answers4

3

It looks like the input tree may contain children with the same name. To support the general case, you could convert each Tree into a dictionary that maps its name to its children list:

from nltk import Tree # $ pip install nltk

def tree2dict(tree):
    return {tree.node: [tree2dict(t)  if isinstance(t, Tree) else t
                        for t in tree]}

Example:

import json
import sys

tree = Tree('scores',
            [Tree('extent', ['highest']),
             Tree('filler',
                  [Tree('filler', ['scores']),
                   Tree('filler', ['for'])]),
             Tree('team', ['India'])])
d = tree2dict(tree)
json.dump(d, sys.stdout, indent=2)

Output:

{
  "scores": [
    {
      "extent": [
        "highest"
      ]
    }, 
    {
      "filler": [
        {
          "filler": [
            "scores"
          ]
        }, 
        {
          "filler": [
            "for"
          ]
        }
      ]
    }, 
    {
      "team": [
        "India"
      ]
    }
  ]
}
jfs
  • 399,953
  • 195
  • 994
  • 1,670
2

Convert Tree to dict and then to JSON.

def tree_to_dict(tree):
    tdict = {}
    for t in tree:
        if isinstance(t, nltk.Tree) and isinstance(t[0], nltk.Tree):
            tdict[t.node] = tree_to_dict(t)
        elif isinstance(t, nltk.Tree):
            tdict[t.node] = t[0]
    return tdict

def dict_to_json(dict):
    return json.dumps(dict)

output_json = dict_to_json({tree.node: tree_to_dict(tree)})
Raj
  • 22,346
  • 14
  • 99
  • 142
  • 2
    convert the `tree` into a `dict` and use `json.dump(result_dict, sys.stdout, indent=2)` instead of generating json text by hand. – jfs Apr 16 '14 at 16:38
  • Thanks. Will look into it again. – Raj Apr 16 '14 at 16:41
  • @J.F.Sebastian How to convert tree into dict? which method should I use? – Raj Apr 18 '14 at 07:51
  • `t.node` has to be switched to `t.label()` now. for the sentence "Tom Brady plays for the Patriots." the output was: `{'ORGANIZATION': ('Patriots', 'NNP'), 'PERSON': ('Brady', 'NNP')}` – Alex Moore-Niemi Feb 28 '18 at 19:00
2

the will covert the tree to a dictionary with the tree labels as the key, then you can convert it into JSON using by using JSON dumps easily

    import nltk.tree.Tree

    def tree_to_dict(tree):
        tree_dict = dict()
        leaves = []
        for subtree in tree:
            if type(subtree) == nltk.tree.Tree:
                tree_dict.update(tree_to_dict(subtree))
            else:
                (expression,tag) = subtree
                leaves.append(expression)
        tree_dict[tree.label()] = " ".join(leaves)

        return tree_dict
  • as a point of comparison, this outputs `{'ORGANIZATION': 'Patriots', 'PERSON': 'Brady', 'S': 'plays for the .'}` for the sentence "Tom Brady Plays for the Patriots." – Alex Moore-Niemi Feb 28 '18 at 19:01
0

A related alternative. For my purposes, I didn't need an exact tree preserved, but instead wanted to extract entities as keys and tokens as lists of values. For the sentence "Tom and Larry play for the Patriots." I wanted the following JSON:

{
  "PERSON": [
    "Tom",
    "Larry"
  ],
  "ORGANIZATION": [
    "Patriots"
  ]
}

This preserves order of tokens (per entity type), while also not "stomping" values set for an entity key. You can reuse the same json.dump code in the other answers to return this dict to json.

from nltk import tag,chunk,tokenize

def prep(sentence):
    return chunk.ne_chunk(tag.pos_tag(tokenize.word_tokenize(sentence)))

t = prep("Tom and Larry play for the Patriots.")

def tree_to_dict(tree):
    tree_dict = dict()
    for st in tree:
        # not everything gets a NE tag,
        # so we can ignore untagged tokens
        # which are stored in tuples
        if isinstance(st, nltk.Tree):
            if st.label() in tree_dict:
                tree_dict[st.label()] = tree_dict[st.label()] + [st[0][0]]
            else:
                tree_dict[st.label()] = [st[0][0]]
    return tree_dict

print(tree_to_dict(t))
# {'PERSON': ['Tom', 'Larry'], 'ORGANIZATION': ['Patriots']}
Alex Moore-Niemi
  • 2,913
  • 2
  • 24
  • 22