My question is similar to this question. In spacy
, I can do part-of-speech tagging and noun phrase identification separately e.g.
import spacy
nlp = spacy.load('en')
sentence = 'For instance , consider one simple phenomena :
a question is typically followed by an answer ,
or some explicit statement of an inability or refusal to answer .'
token = nlp(sentence)
token_tag = [(word.text, word.pos_) for word in token]
Output looks like:
[('For', 'ADP'),
('instance', 'NOUN'),
(',', 'PUNCT'),
('consider', 'VERB'),
('one', 'NUM'),
('simple', 'ADJ'),
('phenomena', 'NOUN'),
...]
For Noun phrase or chunk, I can get noun_chunks
which is a chunk of words as follows:
[nc for nc in token.noun_chunks] # [instance, one simple phenomena, an answer, ...]
I'm wondering if there is a way to cluster the POS tag based on noun_chunks
so that I get the output as
[('For', 'ADP'),
('instance', 'NOUN'), # or NOUN_CHUNKS
(',', 'PUNCT'),
('one simple phenomena', 'NOUN_CHUNKS'),
...]