1

My question is similar to this question. In spacy, I can do part-of-speech tagging and noun phrase identification separately e.g.

import spacy
nlp = spacy.load('en')
sentence = 'For instance , consider one simple phenomena : 
            a question is typically followed by an answer , 
            or some explicit statement of an inability or refusal to answer .'
token = nlp(sentence)
token_tag = [(word.text, word.pos_) for word in token]

Output looks like:

[('For', 'ADP'),
 ('instance', 'NOUN'),
 (',', 'PUNCT'),
 ('consider', 'VERB'),
 ('one', 'NUM'),
 ('simple', 'ADJ'),
 ('phenomena', 'NOUN'), 
 ...]

For Noun phrase or chunk, I can get noun_chunks which is a chunk of words as follows:

[nc for nc in token.noun_chunks] # [instance, one simple phenomena, an answer, ...]

I'm wondering if there is a way to cluster the POS tag based on noun_chunks so that I get the output as

[('For', 'ADP'),
 ('instance', 'NOUN'), # or NOUN_CHUNKS
 (',', 'PUNCT'),
 ('one simple phenomena', 'NOUN_CHUNKS'), 
 ...]
Community
  • 1
  • 1
titipata
  • 5,321
  • 3
  • 35
  • 59

1 Answers1

2

I figured out how to do it. Basically, we can get start and end position of the noun phrase token as follows:

noun_phrase_position = [(s.start, s.end) for s in token.noun_chunks]
noun_phrase_text = dict([(s.start, s.text) for s in token.noun_chunks])
token_pos = [(i, t.text, t.pos_) for i, t in enumerate(token)]

Then I combine with this solution in order to merge list of token_pos based on start, stop position

result = []
for start, end in noun_phrase_position:
    result += token_pos[index:start]
    result.append(token_pos[start:end])
    index = end

result_merge = []
for i, r in enumerate(result):
    if len(r) > 0 and isinstance(r, list):
        result_merge.append((r[0][0], noun_phrase_text.get(r[0][0]), 'NOUN_PHRASE'))
    else:
        result_merge.append(r)

Output

[(1, 'instance', 'NOUN_PHRASE'),
 (2, ',', 'PUNCT'),
 (3, 'consider', 'VERB'),
 (4, 'one simple phenomena', 'NOUN_PHRASE'),
 (7, ':', 'PUNCT'),
 (8, 'a', 'DET'), ...
Community
  • 1
  • 1
titipata
  • 5,321
  • 3
  • 35
  • 59