I would like to use Spacy matchers to mine "is a" (and other) relationships from Wikipedia in order to build a knowledge database.
I have the following code:
nlp = spacy.load("en_core_web_lg")
text = u"""Garfield is a large comic strip cat that lives in Ohio. Cape Town is the oldest city in South Africa."""
doc = nlp(text)
sentence_spans = list(doc.sents)
# Write a pattern
pattern = [
{"POS": "PROPN", "OP": "+"},
{"LEMMA": "be"},
{"POS": "DET"},
{"POS": "ADJ", "OP": "*"},
{"POS": "NOUN", "OP": "+"}
]
# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IS_A_PATTERN", None, pattern)
matches = matcher(doc)
# Iterate over the matches and print the span text
for match_id, start, end in matches:
print("Match found:", doc[start:end].text)
Unfortunately this matches:
Match found: Garfield is a large comic strip
Match found: Garfield is a large comic strip cat
Match found: Town is the oldest city
Match found: Cape Town is the oldest city
whereas I just want:
Match found: Garfield is a large comic strip cat
Match found: Cape Town is the oldest city
In addition I wouldn't mind being able to state that the first part of the match must be the subject of the sentence and the last part the predicate.
I would also like to return this separated in this manner:
['Garfield', 'is a', 'large comic strip cat', 'comic strip cat']
['Cape Town', 'is the', 'oldest city', 'city']
So that I can get a list of cities.
Is any of this possible in Spacy or what would the equivalent Python code be?