search similar meaning phrases with nltk

Question

I have a bunch of unrelated paragraphs, and I need to traverse them to find similar occurrences such as that, given a search where I look for object falls, I find a boolean True for text containing:

Box fell from shelf
Bulb shattered on the ground
A piece of plaster fell from the ceiling

And False for:

The blame fell on Sarah
The temperature fell abruptly

I am able to use nltk to tokenise, tag and get Wordnet synsets, but I am finding it hard to figure out how to fit nltk's moving parts together to achieve the desired result. Should I chunk before looking for synsets? Should I write a context-free grammar? Is there a best practice when translating from treebank tags to Wordnet grammar tags? None of this is explained in the nltk book, and I couldn't find it on the nltk cookbook yet.

Bonus points for answers that include pandas in the answer.

[ EDIT ]:

Some code to get things started

In [1]:

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series

def tag(x):
    return pos_tag(word_tokenize(x))

phrases = ['Box fell from shelf',
           'Bulb shattered on the ground',
           'A piece of plaster fell from the ceiling',
           'The blame fell on Sarah',
           'Berlin fell on May',
           'The temperature fell abruptly']

ser = Series(phrases)
ser.map(tag)

Out[1]:

0    [(Box, NNP), (fell, VBD), (from, IN), (shelf, ...
1    [(Bulb, NNP), (shattered, VBD), (on, IN), (the...
2    [(A, DT), (piece, NN), (of, IN), (plaster, NN)...
3    [(The, DT), (blame, NN), (fell, VBD), (on, IN)...
4    [(Berlin, NNP), (fell, VBD), (on, IN), (May, N...
5    [(The, DT), (temperature, NN), (fell, VBD), (a...
dtype: object

A [similar question](http://stackoverflow.com/questions/11798389/what-nlp-tools-to-use-to-match-phrases-having-similar-meaning-or-sematics) was posted before, but I am hoping to attract answers with at least `pseudocode`. — dmvianna, Mar 17 '14 at 11:08

score 7 · Accepted Answer · answered Mar 31 '14 at 01:59

The way I would do it is the following:

Use nltk to find nouns followed by one or two verbs. In order to match your exact specifications I would use Wordnet: The only nouns (NN, NNP, PRP, NNS) that should be found are the ones that are in a semantic relation with "physical" or "material" and the only verbs (VB, VBZ, VBD, etc...) that should be found are the ones that are in a semantic relation with "fall".

I mentioned "one or two verbs" because a verb can be preceded by an auxiliary. What you could also do is create a dependency tree to spot subject-verb relations, but it does not seem to be necessary in this case.

You might also want to make sure you exclude location names and keep person names (Because you would accept "John has fallen" but not "Berlin has fallen"). This can also be done with Wordnet, locations have the tag 'noun.location'.

I am not sure in which context you would have to convert the tags so I cannot provide a proper answer to that, in seems to me that you might not need that in this case: You use the POS tags to identify nouns and verbs and then you check if each noun and verb belong to a synset.

Hope this helps.

How would I use NLTK to find nouns followed by one or two verbs? Should I use the `RegexpParser` and a grammar? Your answer is helpful, but it would be more so with a code example. — dmvianna, Apr 01 '14 at 01:27
Get the result from ser.map(tag) and loop through each element, if the current element is an NN, NNP or PRP check the next two if they are VB*. If the tags fit, take the words and check them using wordnet (I have no idea how to do that from python). — Amandil, Apr 01 '14 at 18:12

score 0 · Answer 2 · edited May 23 '17 at 12:02

Not perfect, but most of the work is there. Now on to hardcoding pronouns (such as 'it') and closed-class words and adding multiple targets to handle things like 'shattered'. Not a single-liner, but not an impossible task!

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series, DataFrame
import collections
from nltk import wordnet
wn = wordnet.wordnet

def tag(x):
    return pos_tag(word_tokenize(x))

def flatten(l):
    for el in l:
        if isinstance(el, collections.Iterable) and not isinstance(el, basestring):
            for sub in flatten(el):
                yield sub
        else:
            yield el

def noun_verb_match(phrase, nouns, verbs):
    res = []
    for i in range(len(phrase) -1):
        if (phrase[i][1] in nouns) &\
            (phrase[i + 1][1] in verbs):
            res.append((phrase[i], phrase[i + 1]))
    return res

def hypernym_paths(word, pos):
    res = [x.hypernym_paths() for x in wn.synsets(word, pos)]
    return set(flatten(res))

def bool_syn(double, noun_syn, verb_syn):
    """
    Returns boolean if noun/verb double contains the target Wordnet Synsets.
    Arguments:
    double: ((noun, tag), (verb, tag))
    noun_syn, verb_syn: Wordnet Synset string (i.e., 'travel.v.01')
    """
    noun = double[0][0]
    verb = double[1][0]
    noun_bool = wn.synset(noun_syn) in hypernym_paths(noun, 'n')
    verb_bool = wn.synset(verb_syn) in hypernym_paths(verb, 'v')
    return noun_bool & verb_bool

def bool_loop(l, f):
    """
    Tests all list elements for truthiness and
    returns True if any is True.
    Arguments:
    l: List.
    e: List element.
    f: Function returning boolean.
    """
    if len(l) == 0:
        return False
    else:
        return f(l[0]) | bool_loop(l[1:], f)

def bool_noun_verb(series, nouns, verbs, noun_synset_target, verb_synset_target):
    tagged = series.map(tag)
    nvm = lambda x: noun_verb_match(x, nouns, verbs)
    matches = tagged.apply(nvm)
    bs = lambda x: bool_syn(x, noun_synset_target, verb_synset_target)
    return matches.apply(lambda x: bool_loop(x, bs))

phrases = ['Box fell from shelf',
           'Bulb shattered on the ground',
           'A piece of plaster fell from the ceiling',
           'The blame fell on Sarah',
           'Berlin fell on May',
           'The temperature fell abruptly',
           'It fell on the floor']

nouns = "NN NNP PRP NNS".split()
verbs = "VB VBD VBZ".split()
noun_synset_target = 'artifact.n.01'
verb_synset_target = 'travel.v.01'

df = DataFrame()
df['text'] = Series(phrases)
df['fall'] = bool_noun_verb(df.text, nouns, verbs, noun_synset_target, verb_synset_target)
df

search similar meaning phrases with nltk

2 Answers2