I am trying to construct a dependency matcher that catches certain phrases in document and prints out paragraphs containing those phrases. These are a pre-existing long list of verb-noun combinations.
The wider purpose of this exercise is to pore through a large set of PDF documents to analyze what types of activities were undertaken, by whom, and with what frequency. The task is split into two parts. The first is to extract paragraphs containing these phrases(verb-noun etc) for humans to look at and verify a random sample of so we know the parsing is working properly. Then using other characteristics associated with each PDF, do further analysis of the types of tasks (drafting/create/perform > document/task type "x") being performed, by whom, when, etc etc.
One example is "draft/prepare" > "procurement and market risk assessment".
I looked at the dependency tree of a sample sentence and then set up the Dependency matcher to work with that. Please see example below.
The sample sentence is "He drafted the procurement & market risk assessment". The dependency seems to be draft > assessment > procurement > risk > market
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import DependencyMatcher
dmatcher = DependencyMatcher(nlp.vocab)
doc = nlp("""He drafted the procurement & market risk assessment.""")
lemma_list_drpr = ['draft', 'prepare']
print("----- Using Dependency Matcher -----")
deppattern22 = [
{'SPEC' : {"NODE_NAME": "drpr"}, "PATTERN":{"LEMMA": {"IN": lemma_list_drpr}}},
{"SPEC": {"NBOR_NAME": "drpr", "NBOR_RELOP": ">", "NODE_NAME": "ass2"}, "PATTERN":
{"LEMMA": "assessment"}},
{"SPEC": {"NBOR_NAME": "ass2", "NBOR_RELOP": ">", "NODE_NAME": "proc2"}, "PATTERN":
{"LEMMA": "procurement"}}
]
dmatcher.add("Pat22", patterns = [deppattern22])
for number, mylist in dmatcher(doc):
for item in mylist:
print(doc[item[0]].sent)
When I do this, it works.
However, there are many problems here.
When I try to add "risk" and "market" terms to the matcher then it no longer works:
deppattern22a = [ {'SPEC' : {"NODE_NAME": "drpr"}, "PATTERN":{"LEMMA": {"IN": lemma_list_drpr}}}, {"SPEC": {"NBOR_NAME": "drpr", "NBOR_RELOP": ">", "NODE_NAME": "ass2"}, "PATTERN": {"LEMMA": "assessment"}}, {"SPEC": {"NBOR_NAME": "ass2", "NBOR_RELOP": ">", "NODE_NAME": "proc2"}, "PATTERN": {"LEMMA": "procurement"}}, {"SPEC": {"NBOR_NAME": "proc2", "NBOR_RELOP": ">", "NODE_NAME": "risk2"}, "PATTERN": {"LEMMA": "risk"}}, {"SPEC": {"NBOR_NAME": "risk2", "NBOR_RELOP": ">", "NODE_NAME": "mkt2"}, "PATTERN": {"LEMMA": "market"}} ]
Moreover, when I change the sentence text a little bit, by replacing "&" by "and" then the dependency changes so my dependency matcher doesn't work again. The dependency becomes draft > procurement > assessment > ... whereas in the earlier sample sentence it was draft > assessment > procurement > ...
- The dependency changes back when I add other text to the sentence.
What would be a good way to find such matches that are not sensitive to minor changes in sentence structure?