0

I am trying to construct a dependency matcher that catches certain phrases in document and prints out paragraphs containing those phrases. These are a pre-existing long list of verb-noun combinations.

The wider purpose of this exercise is to pore through a large set of PDF documents to analyze what types of activities were undertaken, by whom, and with what frequency. The task is split into two parts. The first is to extract paragraphs containing these phrases(verb-noun etc) for humans to look at and verify a random sample of so we know the parsing is working properly. Then using other characteristics associated with each PDF, do further analysis of the types of tasks (drafting/create/perform > document/task type "x") being performed, by whom, when, etc etc.

One example is "draft/prepare" > "procurement and market risk assessment".

I looked at the dependency tree of a sample sentence and then set up the Dependency matcher to work with that. Please see example below.

The sample sentence is "He drafted the procurement & market risk assessment". The dependency seems to be draft > assessment > procurement > risk > market

enter image description here

    import spacy
    nlp = spacy.load('en_core_web_sm')
    from spacy.matcher import DependencyMatcher
    
    dmatcher = DependencyMatcher(nlp.vocab)
    
    doc = nlp("""He drafted the procurement & market risk assessment.""")
    
    lemma_list_drpr = ['draft', 'prepare']
    
    print("----- Using Dependency Matcher -----")
    
    deppattern22 = [
        {'SPEC' : {"NODE_NAME": "drpr"}, "PATTERN":{"LEMMA": {"IN": lemma_list_drpr}}},
        {"SPEC": {"NBOR_NAME": "drpr", "NBOR_RELOP": ">", "NODE_NAME": "ass2"}, "PATTERN": 
              {"LEMMA": "assessment"}},
        {"SPEC": {"NBOR_NAME": "ass2", "NBOR_RELOP": ">", "NODE_NAME": "proc2"}, "PATTERN": 
             {"LEMMA": "procurement"}}
        ]
    
    dmatcher.add("Pat22", patterns = [deppattern22])
    
    for number, mylist in dmatcher(doc):
        for item in mylist:
           print(doc[item[0]].sent)

When I do this, it works.

enter image description here

However, there are many problems here.

  1. When I try to add "risk" and "market" terms to the matcher then it no longer works:

    deppattern22a = [
             {'SPEC' : {"NODE_NAME": "drpr"}, "PATTERN":{"LEMMA": {"IN": lemma_list_drpr}}},
             {"SPEC": {"NBOR_NAME": "drpr", "NBOR_RELOP": ">", "NODE_NAME": "ass2"}, "PATTERN": 
                   {"LEMMA": "assessment"}},
             {"SPEC": {"NBOR_NAME": "ass2", "NBOR_RELOP": ">", "NODE_NAME": "proc2"}, "PATTERN": 
                  {"LEMMA": "procurement"}},
               {"SPEC": {"NBOR_NAME": "proc2", "NBOR_RELOP": ">", "NODE_NAME": "risk2"}, "PATTERN": 
                 {"LEMMA": "risk"}},
              {"SPEC": {"NBOR_NAME": "risk2", "NBOR_RELOP": ">", "NODE_NAME": "mkt2"}, "PATTERN": 
                   {"LEMMA": "market"}}
             ]
    
  2. Moreover, when I change the sentence text a little bit, by replacing "&" by "and" then the dependency changes so my dependency matcher doesn't work again. The dependency becomes draft > procurement > assessment > ... whereas in the earlier sample sentence it was draft > assessment > procurement > ...

enter image description here

  1. The dependency changes back when I add other text to the sentence.

enter image description here

What would be a good way to find such matches that are not sensitive to minor changes in sentence structure?

Amatya
  • 1,203
  • 6
  • 32
  • 52
  • 1
    & should be preprocessed to "and" before sending to dependency parser to get the correct tree. The dependency tree always changes with sentence updation. Could you please elaborate the task you are trying to achieve? If you looking for intent detection, use the spacy module. check this [one](https://stackoverflow.com/questions/60083593/how-to-retrieve-the-main-intent-of-a-sentence-using-spacy-or-nltk) for more info – gowridev May 31 '21 at 03:36
  • @gowridev let me update the question to describe the underlying purpose of the exercise. Thx – Amatya May 31 '21 at 03:56
  • 1
    I think, search for papers on handling conjunctions in compound nouns. these links may be helpful [link1](https://stackoverflow.com/questions/56110998/extract-compounds-and-dobj-from-dependency-tree-using-spacy) , [link2](https://spacy.io/universe/project/self-attentive-parser) [link3](https://stackoverflow.com/questions/48925328/how-to-get-all-noun-phrases-in-spacy) and [link4](https://stackoverflow.com/questions/51308482/wish-to-extract-compound-noun-adjective-pairs-from-a-sentence-so-basically-i-w) – gowridev May 31 '21 at 06:32

0 Answers0