How do I extract noun phrases in Danish using StanfordNLP in python?

Question

I have so far used the stanfordnlp library in python and I have tokenized and POS tagged a dataframe of text. I would now like to try to extract noun phrases. I have tried two different things, and I am having probles with both:

From what I can see, the stanfordnlp python library doesn't seem to offer NP chunking out of the box, at least I haven't been able to find a way to do it. I have tried making a new dataframe of all words in order with their POS tags, and then checking if nouns are repeated. However, this is very crude and quite complicated for me.
I have been able to do it with English text using nltk, so I have also tried to use the Stanford CoreNLP API in NLTK. My problem in this regard is that I need a Danish model when setting CoreNLP up with Maven (which I am very inexperienced with). For problem 1 of this text, I have been using the Danish model found here. This doesn't seem to be the kind of model I am asked to find - again, I don't exactly now what I am doing so apologies if I am misunderstanding something here.

My questions then are (1) whether it is in fact possible to do chunking of NPs in stanfordnlp in python, (2) whether I can somehow parse the POS-tagged+tokenized+lemmatized words from stanfordnlp to NLTK and do the chunking there, or (3) whether it is possible to set up CoreNLP in Danish and then use the CoreNLP api witih NLTK.

Thank you, and apologies for my lack of clarity here.

score 1 · Answer 1 · answered May 14 '19 at 23:27

Some helpful info:

1.) To the best of my knowledge Stanford CoreNLP (Java) has no support for Danish. We don't have Danish support, and I am unaware of a third-party that has models for Danish. So neither the Java code nor server would be of much help. Though it is certainly possible someone somewhere has some Danish models. I'll try researching on Google a little more.

2.) We do have Danish support for tokenization, part-of-speech, lemma, and dependency parsing for the StanfordNLP (Python) codebase. At this time we don't have any noun phrase identifying software. We don't produce a constituency parse, so we can't just find an NP in a parse tree, it's a dependency parse. I would imagine there are decent techniques for extracting noun phrases based off of dependency parses or based off of chunking part-of-speech. We can work on adding some functionality to help with this. Though such a technique might not be perfect to start out with. But the spirit of UD 2.0 is to handle all languages, so this seems like a perfect case to write generic noun-phrase extraction rules over UD 2.0 parses that would then work on all 70+ languages we have support for in the Python package.

Thank you, that clears it up a lot! I would love to be a part of helping with some kind noun-phrase extraction tool from POS or the dependency parse. — Bertil Johannes Ipsen, May 14 '19 at 23:31

Pedram · Answer 2 · 2019-05-14T22:53:37.333

The way that you can extract chunks from CoreNLP is by using the output of constituency parser. As far as I know, there is no method in CoreNLP that can directly give you a list of chunks, however, you can parse the output of constituency parser, the actual string, and list the chunks based on your needs. For example, for an input sentence like "I bought the book because I read good reviews about it.", the output of your method would be something like:

<class 'list'>: 
[['NP', 'I'], 
['NP', 'the book'], 
['NP', 'I'], 
['NP', 'good reviews'],
['NP', 'it'], 
['SBAR', 'because I read good reviews about it'], 
['VP', 'bought the book because I read good reviews about it'], 
['VP', 'read good reviews about it']]

The output above is from a method I've written myself, I only listed NPs, VPs, and SBARs here, but haven't published the method yet since I need to further test and debug it.

And, if you only need the noun phrases, you may also want to look at Spacy and the solution here which is pretty fast. Everything I mentioned is mainly regarding your first question and partly your second question and I do not know whether these solutions apply to Danish as well or not.

Thanks! My problem is that I can't get the constituency parser to work with the stanfordnlp python library. It's in the stanford-corenlp wrapper, but then I'm back to the problem of having to set up CoreNLP in Danish. — Bertil Johannes Ipsen, May 14 '19 at 23:10

How do I extract noun phrases in Danish using StanfordNLP in python?

2 Answers2