2

I am trying to extract phrases from my corpus for this i have defined two rules one is noun followed by multiple nouns and other is adjective followed by noun, here i want that if same phrase is extracted from both rules the program should ignore second one, the problem I am facing is that the phrases are extracted form the first rule only and the second rule is not being applied. below is the code:

PATTERN = r"""
      NP: {<NN><NN>+}
      {<ADJ><NN>*}

       """
    MIN_FREQ = 1
    MIN_CVAL = -13 # lowest cval -13
    def __init__(self):
        corpus_root = os.path.abspath('../multiwords/test')
        self.corpus = nltk.corpus.reader.TaggedCorpusReader(corpus_root,'.*')
        self.word_count_by_document = None
        self.phrase_frequencies = None

def calculate_phrase_frequencies(self):
        """
       extract the sentence chunks according to PATTERN and calculate
       the frequency of chunks with pos tags
       """

        # pdb.set_trace()
        chunk_freq_dict = defaultdict(int)
        chunker = nltk.RegexpParser(self.PATTERN)

        for sent in self.corpus.tagged_sents():

            sent = [s for s in sent if s[1] is not None]

            for chk in chunker.parse(sent).subtrees():

                if str(chk).startswith('(NP'):                  

                    phrase = chk.__unicode__()[4:-1]

                    if '\n' in phrase:
                        phrase = ' '.join(phrase.split())

                    just_phrase = ' '.join([w.rsplit('/', 1)[0] for w in phrase.split(' ')])
                   # print(just_phrase)
                    chunk_freq_dict[just_phrase] += 1
        self.phrase_frequencies = chunk_freq_dict
        #print(self.phrase_frequencies)
Cœur
  • 37,241
  • 25
  • 195
  • 267
user3778289
  • 323
  • 4
  • 18
  • 1
    Not familiar with the topic but since it's somehow related to regex, you could try something like `{+|*}`... Would be helpful if you could elaborate on how those patterns are supposed to work (example / documentation). – Snow bunting Jan 10 '18 at 12:16
  • @Snowbunting .it will extract the Noun phrases. first one is like Noun and the one or more nouns. say "Barack Obama supports expanding social security." it will extract "Barack Obama" rather then "barack" ,"obama". and then "social security". i want to extract key phrases. – user3778289 Jan 10 '18 at 13:06

1 Answers1

3

First of all, Python and especially multi-line strings are indent dependant. Make sure you have no preceding spaces inside the string (as they will be treated as characters) and make sure the patterns (brackets) align visually.

Moreover I think you might want to have <ADJ><NN>+ as your second pattern. + means 1 or more, whereas * means 0 or more.

I hope this solves the issue.

#!/usr/bin/env python
import nltk

PATTERN = r"""
NP: {<NN><NN>+}
    {<ADJ><NN>+}
"""

sentence = [('the', 'DT'), ('little', 'ADJ'), ('yellow', 'ADJ'),
            ('shepherd', 'NN'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'),
            ('the', 'DT'), ('silly', 'ADJ'), ('cat', 'NN')]

cp = nltk.RegexpParser(PATTERN)
print(cp.parse(sentence))

Result:

(S
  the/DT
  little/ADJ
  yellow/ADJ
  (NP shepherd/NN dog/NN)
  barked/VBD
  at/IN
  the/DT
  (NP silly/ADJ cat/NN))

Reference: http://www.nltk.org/book/ch07.html

Snow bunting
  • 1,120
  • 8
  • 28