Spacy Dependency Matcher problematic and sensitive for long verb-noun phrases

Question

I am trying to construct a dependency matcher that catches certain phrases in document and prints out paragraphs containing those phrases. These are a pre-existing long list of verb-noun combinations.

The wider purpose of this exercise is to pore through a large set of PDF documents to analyze what types of activities were undertaken, by whom, and with what frequency. The task is split into two parts. The first is to extract paragraphs containing these phrases(verb-noun etc) for humans to look at and verify a random sample of so we know the parsing is working properly. Then using other characteristics associated with each PDF, do further analysis of the types of tasks (drafting/create/perform > document/task type "x") being performed, by whom, when, etc etc.

One example is "draft/prepare" > "procurement and market risk assessment".

I looked at the dependency tree of a sample sentence and then set up the Dependency matcher to work with that. Please see example below.

The sample sentence is "He drafted the procurement & market risk assessment". The dependency seems to be draft > assessment > procurement > risk > market

    import spacy
    nlp = spacy.load('en_core_web_sm')
    from spacy.matcher import DependencyMatcher
    
    dmatcher = DependencyMatcher(nlp.vocab)
    
    doc = nlp("""He drafted the procurement & market risk assessment.""")
    
    lemma_list_drpr = ['draft', 'prepare']
    
    print("----- Using Dependency Matcher -----")
    
    deppattern22 = [
        {'SPEC' : {"NODE_NAME": "drpr"}, "PATTERN":{"LEMMA": {"IN": lemma_list_drpr}}},
        {"SPEC": {"NBOR_NAME": "drpr", "NBOR_RELOP": ">", "NODE_NAME": "ass2"}, "PATTERN": 
              {"LEMMA": "assessment"}},
        {"SPEC": {"NBOR_NAME": "ass2", "NBOR_RELOP": ">", "NODE_NAME": "proc2"}, "PATTERN": 
             {"LEMMA": "procurement"}}
        ]
    
    dmatcher.add("Pat22", patterns = [deppattern22])
    
    for number, mylist in dmatcher(doc):
        for item in mylist:
           print(doc[item[0]].sent)

When I do this, it works.

However, there are many problems here.

When I try to add "risk" and "market" terms to the matcher then it no longer works:

deppattern22a = [
         {'SPEC' : {"NODE_NAME": "drpr"}, "PATTERN":{"LEMMA": {"IN": lemma_list_drpr}}},
         {"SPEC": {"NBOR_NAME": "drpr", "NBOR_RELOP": ">", "NODE_NAME": "ass2"}, "PATTERN": 
               {"LEMMA": "assessment"}},
         {"SPEC": {"NBOR_NAME": "ass2", "NBOR_RELOP": ">", "NODE_NAME": "proc2"}, "PATTERN": 
              {"LEMMA": "procurement"}},
           {"SPEC": {"NBOR_NAME": "proc2", "NBOR_RELOP": ">", "NODE_NAME": "risk2"}, "PATTERN": 
             {"LEMMA": "risk"}},
          {"SPEC": {"NBOR_NAME": "risk2", "NBOR_RELOP": ">", "NODE_NAME": "mkt2"}, "PATTERN": 
               {"LEMMA": "market"}}
         ]

Moreover, when I change the sentence text a little bit, by replacing "&" by "and" then the dependency changes so my dependency matcher doesn't work again. The dependency becomes draft > procurement > assessment > ... whereas in the earlier sample sentence it was draft > assessment > procurement > ...

The dependency changes back when I add other text to the sentence.

What would be a good way to find such matches that are not sensitive to minor changes in sentence structure?

& should be preprocessed to "and" before sending to dependency parser to get the correct tree. The dependency tree always changes with sentence updation. Could you please elaborate the task you are trying to achieve? If you looking for intent detection, use the spacy module. check this [one](https://stackoverflow.com/questions/60083593/how-to-retrieve-the-main-intent-of-a-sentence-using-spacy-or-nltk) for more info — gowridev, May 31 '21 at 03:36
@gowridev let me update the question to describe the underlying purpose of the exercise. Thx — Amatya, May 31 '21 at 03:56
I think, search for papers on handling conjunctions in compound nouns. these links may be helpful [link1](https://stackoverflow.com/questions/56110998/extract-compounds-and-dobj-from-dependency-tree-using-spacy) , [link2](https://spacy.io/universe/project/self-attentive-parser) [link3](https://stackoverflow.com/questions/48925328/how-to-get-all-noun-phrases-in-spacy) and [link4](https://stackoverflow.com/questions/51308482/wish-to-extract-compound-noun-adjective-pairs-from-a-sentence-so-basically-i-w) — gowridev, May 31 '21 at 06:32

Spacy Dependency Matcher problematic and sensitive for long verb-noun phrases

0 Answers0