NLTK Convert Tree to Array?

Question

To start with I turned the tree into a list: You insert an already tokenized sentence and it returns a tree.

def LanguageCreateTree(tokenizedSentence):
    cp = nltk.RegexpParser(GRAMMAR)
    result = cp.parse(tokenizedSentence)
    result = str(result)
    print(result)

>>> A red cat with a hat
(S A/DT (VP red/VBN (NP cat/NN)) with/IN a/DT hat/JJ)

How would I go about to make a list with lists in it based on this string? I need it to be able to make a list like this:

[['A','DT'], ['VP', ['red','VBN'], ['NP', ['cat','NN']]], ['with','IN'], ['a','DT'], ['hat','JJ']]]

Is it possible to start from the tree instead of going to a string and then back to basically a tree? — KobeJohn, Sep 13 '15 at 12:00
Also isn't your parsed list missing the `'S'` designator at the beginning? — KobeJohn, Sep 13 '15 at 13:25
The output you describe is still a tree; a list containing other lists etc recursively. — tripleee, Sep 13 '15 at 14:54

alexis · Accepted Answer · 2015-09-13T19:14:41.980

3

This is a lot easier than you think :-) The NLTK's Tree class is a list (more specifically, it is derived from the list class). And it has exactly the structure you are after. Just use ordinary list methods on the result of cp.parse(). Here's an approximate example (building a tree on the fly for illustration):

>>> from nltk import Tree
>>> t = Tree.fromstring("(S A/DT (VP red/VBN (NP cat/NN)) with/IN a/DT hat/JJ)")

>>> print(t[1])
(VP red/VBN (NP cat/NN))
>>> print(t[1][0])   # Element 0 of the subtree at index 1
red/VBN

In this example I did not split the word from the POS tag; your tree will look different. Also note that Tree has nice ways of printing itself, but you can see the real structure by using repr():

>>> print(repr(t[1]))
Tree('VP', ['red/VBN', Tree('NP', ['cat/NN'])])

edited Sep 13 '15 at 19:14

answered Sep 13 '15 at 14:19

alexis

48,685
16
101
161

Thank you, that's a good solution. But how would I go about separating the "A/DT"s then? So that they become two different list items too. – deepadmax Sep 13 '15 at 14:53
1

In your own tree, there shouldn't be anything to separate: words and POS tags are separate, this is just how they're printed. Try it, and provide a short but complete (executable!) example if you can't get it to work. – alexis Sep 13 '15 at 15:47
If I just skip what you told me to do and just use result[x] It works with "A/DT" and it shows it as a list like so: ('A','DT'). – deepadmax Sep 13 '15 at 17:50
No, **now** you're doing what I told you to do: "Just use ordinary list methods on the result of `cp.parse()`". You hadn't understood me before, my example is just for illustration since your code is incomplete. [In the code you give in your question, `result` is a string so what you say is impossible; but I'm sure your real code is different). – alexis Sep 13 '15 at 19:07
But glad you got it to work. If my answer has resolved your problem, please "accept" it by clicking on the big check mark next to it. – alexis Sep 13 '15 at 19:08
Yeah, sorry. This is my first time posting here so I'll try to be more clear next time. Thank you for your help! :) – deepadmax Sep 13 '15 at 19:40
No prob! But incomplete questions are harder to answer, as you can see. Welcome to stackoverflow, by the way :-) – alexis Sep 13 '15 at 20:28
I have the same problem. I want to convert my cp.parse to array structure.Pls take look at my question. This is not the answer by the way: http://stackoverflow.com/questions/36702150/python-re-write-the-text-with-its-proper-nouns-chunked – Arda Nalbant Apr 22 '16 at 10:32
@Arda, if it's the same problem, use the same solution! But actually the answer to your question is "use a Named Entity Recognizer". I don't understand why you keep saying it isn't. If you want to do this for Turkish, train a NER. Anyway your question was marked a duplicate (corretly, in my opinion). If you're sure it isn't, you must edit it and explain why not, and ask for it to be reopened. – alexis Apr 22 '16 at 10:55
Also see this answer (also by me) to a related question: http://stackoverflow.com/a/30665014/699305 – alexis Apr 22 '16 at 10:56
Dear Mr @alexis trust me i don't want waste anybody's time. I really searched and tried the answers. Let me explain why i should'nt use ner tagger; my graduation project is creating my own NER for turkish with my own searching methods(i created sets , fucntions which gives me the person org and locs). All i want is making the cp.parse output an two dimensional array. I searched the book and could'nt remove the tags ı really feel sorry about that. I'm struggling to figure out how can i only print the [[Rami Eid],[Stony Brook University],[NY]] without any pos or ner tagger. – Arda Nalbant Apr 22 '16 at 11:14
Basically i want to only use nltk for only count proper nouns as one array member.So sorry for very long comment i didnt want to create same question for many time.Feel free to tell me anything i did wrong and tell me what to do.Also thank you again for reply. – Arda Nalbant Apr 22 '16 at 11:14
I'm afraid "I couldn't get it to work" is not a question anyone can answer. And look: grouping capitalized words is not the way to write a NER; if I were your teacher I'd fail you if that was your project. Train an NER for Turkish. – alexis Apr 22 '16 at 11:25
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/109896/discussion-between-alexis-and-arda-nalbant). – alexis Apr 22 '16 at 11:25

NLTK Convert Tree to Array?

1 Answers1