2

I need to create a not condition as part of my grammar in NLTK's regex parser. I would like to chunk those words which are of structure 'Coffee & Tea' but it should not chunk if there is a word of type <IN> before the sequence. For example 'in London and Paris' should not be chunked by the parser.

My code is as follows:

grammar = r'''NP: {(^<IN>)<NNP>+<CC><NN.*>+}'''

I tried the above grammar to solve the problem but it is not working could someone please tell me what I am doing wrong.

Example:

def parse_sentence(sentence):
    pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))
    grammar = r'''NP: {<NNP>+<CC><NN.*>+}'''
    parser = nltk.RegexpParser(grammar)
    result = parser.parse(pos_sentence)
    print result

sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)

sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)

Result for sentence 1 is:
(S
  Who/WP
  is/VBZ
  the/DT
  front/JJ
  man/NN
  of/IN
  the/DT
  band/NN
  that/WDT
  wrote/VBD
  (NP Coffee/NNP &/CC TV/NN)
  ?/.)

Result for sentence2 is:
(S
  Who/WP
  of/IN
  those/DT
  resting/VBG
  in/IN
  Westminster/NNP
  Abbey/NNP
  wrote/VBD
  a/DT
  book/NN
  set/VBN
  in/IN
  (NP London/NNP and/CC Paris/NNP)
  ?/.)

As can be seen in both sentence1 and sentence2 the phrases Coffee & Tea and London and Paris get chunked as a group although I do not wish to chunk London and Paris. One way of doing that is to ignore those patterns which are preceded by a <IN> POS Tag.

In a nutshell I need to know how to add NOT(negation) conditions for POS tags in a regex parser's grammar. Standard syntax of using '^' followed by the tag definition does not seem to work

Ram G Athreya
  • 4,892
  • 6
  • 25
  • 57
  • can you give more context in how you are using this? It would be easier if you provided a [MCVE] – Nathan McCoy Mar 11 '17 at 04:36
  • I am also adding an example to the question. I just need to know how to add NOT(negation) conditions for POS tags in a regex parser. Standard syntax of using '^' followed by the tag definition does not seem to work. – Ram G Athreya Mar 11 '17 at 04:48
  • In a regex, `^` normally means the start of the line. It only means "not" inside a character class (square brackets). – alexis Mar 11 '17 at 20:59

3 Answers3

3

What you need is a "negative lookbehind" expression. Unfortunately, it doesn't work in the chunk parser, so I suspect that what you want cannot be specified as a chunking regexp.

Here is an ordinary negative lookbehind: Match "Paris", but not if preceded by "and ".

>>> re.findall(r"(?<!and) Paris", "Search in London and Paris etc.")
[]

Unfortunately, the corresponding lookbehind chunking rule does not work. The nltk's regexp engine tweaks the regexp you pass it in order to interpret the POS types, and it gets confused by lookbehinds. (I'm guessing the < character in the lookbehind syntax is misinterpreted as a tag delimiter.)

>>> parser = nltk.RegexpParser(r"NP: {(?<!<IN>)<NNP>+<CC><NN.*>+}")
...
ValueError: Illegal chunk pattern: {(?<!<IN>)<NNP>+<CC><NN.*>+}
alexis
  • 48,685
  • 16
  • 101
  • 161
2

NLTK's Tag chunking Documentation is a bit confusing, and not easy reachable, so I struggled a lot in order to accomplish something similar.

Check following links:

Following @Luda's answer, I found an easy solution:

  1. Chunk what you want: <IN>*<other tags> tags. This will create chunks starting with any word with 0 or more <IN> tag.
  2. Chink <IN><other tags> tags from the previous chunk expression. This will remove all chunks starting with one <IN> tagged word.(We removed the asterisk).

Example (taking @Ram G Athreya's question):

def parse_sentence(sentence):
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))

grammar = r'''
    NP: {<IN>*<NNP>+<CC><NN.*>+}
        }<IN><NNP>+<CC><NN.*>+{
        '''
parser = nltk.RegexpParser(grammar)
result = parser.parse(pos_sentence)
print (result)

sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)

sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)


 (S
  Who/WP
  is/VBZ
  the/DT
  front/JJ
  man/NN
  of/IN
  the/DT
  band/NN
  that/WDT
  wrote/VBD
  (NP Coffee/NNP &/CC TV/NN)
  ?/.)
(S
  Who/WP
  of/IN
  those/DT
  resting/VBG
  in/IN
  Westminster/NNP
  Abbey/NNP
  wrote/VBD
  a/DT
  book/NN
  set/VBN
  in/IN
  London/NNP
  and/CC
  Paris/NNP
  ?/.)

Now it chunks "coffee & TV" but it doesn't chunk "London and Paris"


Moreover, this is useful to build lookbehind assertions, in RegExp normally is ?<= , but this creates conflict with the < and > symbols used in chunk_tag grammar regex.

So, in order to build a lookbehind, we could try the following:

  1. Chunk what you want, including <IN> tag at the beginning, followed by other tags you want. This will create chunks starting with any word with 0 or more <IN> tag.
  2. Chink <IN> tag from the previous chunk expression. This will remove all <IN> tagged words from chunks.

Example 2 - Chunk all words preceded by an <IN> tagged word:

def parse_sentence(sentence):
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))

grammar = r'''
    CHUNK: {<IN>+<.*>}
        }<IN>{
        '''
parser = nltk.RegexpParser(grammar)
result = parser.parse(pos_sentence)
print (result)

sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)

sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)

(S
  Who/WP
  is/VBZ
  the/DT
  front/JJ
  man/NN
  of/IN
  (CHUNK the/DT)
  band/NN
  that/WDT
  wrote/VBD
  Coffee/NNP
  &/CC
  TV/NN
  ?/.)
(S
  Who/WP
  of/IN
  (CHUNK those/DT)
  resting/VBG
  in/IN
  (CHUNK Westminster/NNP)
  Abbey/NNP
  wrote/VBD
  a/DT
  book/NN
  set/VBN
  in/IN
  (CHUNK London/NNP)
  and/CC
  Paris/NNP
  ?/.)

As we can see, it chunked "the" from sentence1; "those", "Westminster" and "London" from sentence2

0

cp.2.5 "Chinking"

"We can define a chink to be a sequence of tokens that is not included in a chunk"

http://www.nltk.org/book/ch07.html

See inverse curly braces for exclusion

grammar = 
        r"""
          NP:
            {<.*>+}          # Chunk everything
            }<VBD|IN>+{      # Chink sequences of VBD and IN

         """
Luda
  • 3
  • 2
  • I don't know NLTK, but it sounds as though in the example in the question, `in London and Paris`, not only the `in` should be excluded from chunking, but also the otherwise-chunked `London and Paris`. However, if you can expand this answer to explain, I dunno, how to make the `in` trigger chinking for the following sequence, this would be useful. – Nathan Tuggy Jul 05 '17 at 01:12