1

Is there a way to get the distance of a Noun from the Verb from multiple sentences in a csv file using NLTK and Python?

Example of sentences in a .csv file:

video shows adam stabbing the bystander.
woman quickly ran from the police after the incident.

Output:

1st sentence: 1 (Verb is right after the noun)

2nd sentence: 2 (Verb is after another POS tag)

Beginner
  • 89
  • 7
  • That sounds very similar to [Extract nouns and verbs using nltk?](https://stackoverflow.com/questions/50151820/extract-nouns-and-verbs-using-nltk) – Stef Mar 23 '22 at 10:17

1 Answers1

1

Distance between first verb and previous noun

Inspired by the very similar question Extract nouns and verbs using nltk?.

import nltk

def dist_noun_verb(text):
    text = nltk.word_tokenize(text)
    pos_tagged = nltk.pos_tag(text)
    last_noun_pos = None
    for pos, (word, function) in enumerate(pos_tagged):
        if function.startswith('NN'):
            last_noun_pos = pos
        elif function.startswith('VB'):
            assert(last_noun_pos is not None)
            return pos - last_noun_pos

for sentence in ['Video show Adam stabbing the bystander.', 'Woman quickly ran from the police after the incident.']:
    print(sentence)
    d = dist_noun_verb(sentence)
    print('Distance noun-verb: ', d)

Output:

Video show Adam stabbing the bystander.
Distance noun-verb:  1
Woman quickly ran from the police after the incident.
Distance noun-verb:  2

Note that function.startswith('VB') detects the first verb in the sentence. If you want to make a distinction between the principal verb or some other kind of verb you need to examine the different kinds of verbs classified by nltk.pos_tagged: 'VBP', 'VBD', etc.

Also, the assert(last_noun_pos is not None) line in my code means the code will crash if the first verb comes before any noun. You might want to handle that differently.

Interestingly, if I add an 's' to 'show' and make the sentence 'Video shows Adam stabbing the bystander.', then nltk parses 'shows' as a noun rather than a verb.

Going further: distance between "main" verb and previous noun

Consider the sentence:

'The umbrella that I used to protect myself from the rain was red.'

This sentence contains three verbs: 'used', 'protect', 'was'. Using nltk.word_tokenize.pos_tag as I did above would correctly identify those three verbs:

text = 'The umbrella that I used to protect myself from the rain was red.'
tokens = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(tokens)
print(pos_tagged)
# [('The', 'DT'), ('umbrella', 'NN'), ('that', 'IN'), ('I', 'PRP'), ('used', 'VBD'), ('to', 'TO'), ('protect', 'VB'), ('myself', 'PRP'), ('from', 'IN'), ('the', 'DT'), ('rain', 'NN'), ('was', 'VBD'), ('red', 'JJ'), ('.', '.')]
print([(w,f) for w,f in pos_tagged if f.startswith('VB')])
# [('used', 'VBD'), ('protect', 'VB'), ('was', 'VBD')]

However, the main verb of the sentence is 'was'; the other two verbs are part of the nominal group that forms the subject of the sentence, 'The umbrella that I used to protect myself from the rain'.

Thus we might like to write a function dist_subject_verb that returns the distance between the subject and the main verb 'was', rather than between the first verb 'used' and the previous noun.

One way to identify the main verb is to parse the sentence into a tree, and ignore verbs that are located in subtrees, only considering the verb that is a direct child of the root.

The sentence should be parsed as something like:

((The umbrella) (that (I used) to (protect (myself) (from (the rain))))) (was) (red)

And now we can easily ignore 'used' and 'protect', which are deep into subtrees, and only consider main verb 'was'.

Parsing the sentence into a tree is a much more complex operation that just tokenizing it.

Here is a similar question that deals with parsing a sentence into a tree:

Stef
  • 13,242
  • 2
  • 17
  • 28
  • Thank you for your answer, however I seem to be getting Assertion error, i've got about 80 sentences in my .csv file and the code crashes after around 10-12 sentences. and from your code if I change the sentences to Video show Adam yellow funny quietly stabbing the bystander (adding random adjectives and adverbs) the distance still returns 1. Any idea how to fix that? I'm new to this my apologies – Beginner Mar 23 '22 at 11:25
  • 1
    @Beginner: As I said, you might want to handle the case where the first verb comes before any noun, differently. Maybe return -1 instead of raising and assertion error. Replace `assert(last_noun_pos is not None)` with `if last_noun_pos is None: return whatever_you_want_to_return` – Stef Mar 23 '22 at 11:29
  • On the sentence `'Video show Adam yellow funny quietly stabbing the bystander'`, the distance returned is still 1, because the sentence starts with `'Video' (noun)` followed by `'show' (verb)`, so why would you want to return anything other than 1? – Stef Mar 23 '22 at 11:31
  • If the sentence contains two verbs, such as "show" and then "stabbing", the function I wrote only cares about the first verb. But note that nltk.pos_tag actually differentiates different forms of verbs. You can see a list by typing `nltk.help.upenn_tagset('VB.*')` in a python interpreter. – Stef Mar 23 '22 at 11:33
  • @Beginner Finding out which verb is "the main verb" of the sentence is harder than just finding all verbs. You need to actually parse the sentence rather than just tokenize it. See for instance this question: [How to get parse tree using python nltk?](https://stackoverflow.com/a/71492485/3080723). It builds a tree from the sentence; the "main" verb should be a direct child of the root, and you can ignore all verbs that are in subtrees. – Stef Mar 23 '22 at 11:38