I have a bunch of unrelated paragraphs, and I need to traverse them to find similar occurrences such as that, given a search where I look for object
falls
, I find a boolean True
for text containing:
- Box fell from shelf
- Bulb shattered on the ground
- A piece of plaster fell from the ceiling
And False
for:
- The blame fell on Sarah
- The temperature fell abruptly
I am able to use nltk to tokenise
, tag
and get Wordnet synsets
, but I am finding it hard to figure out how to fit nltk's moving parts together to achieve the desired result. Should I chunk
before looking for synsets? Should I write a context-free grammar
? Is there a best practice when translating from treebank tags to Wordnet grammar tags? None of this is explained in the nltk book, and I couldn't find it on the nltk cookbook yet.
Bonus points for answers that include pandas in the answer.
[ EDIT ]:
Some code to get things started
In [1]:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
def tag(x):
return pos_tag(word_tokenize(x))
phrases = ['Box fell from shelf',
'Bulb shattered on the ground',
'A piece of plaster fell from the ceiling',
'The blame fell on Sarah',
'Berlin fell on May',
'The temperature fell abruptly']
ser = Series(phrases)
ser.map(tag)
Out[1]:
0 [(Box, NNP), (fell, VBD), (from, IN), (shelf, ...
1 [(Bulb, NNP), (shattered, VBD), (on, IN), (the...
2 [(A, DT), (piece, NN), (of, IN), (plaster, NN)...
3 [(The, DT), (blame, NN), (fell, VBD), (on, IN)...
4 [(Berlin, NNP), (fell, VBD), (on, IN), (May, N...
5 [(The, DT), (temperature, NN), (fell, VBD), (a...
dtype: object