I assume you have something similar to wikidata knowledge base that is a giant list of concepts with aliases.
More or less this can be represented as follow:
C1 new york
C1 nyc
C1 big apple
Now the link a spans of a sentence to the above KB, for single words it is easy, you just have to setup a index that maps a single word concept to an identifier.
The difficult part is linking multiple word concepts or phrasal concepts like "new york" or "big apple".
To achieve that I use an algorithm that splits a sentence into all the slices possible. I call those "spans". Then try to match individual span or group of words with a concept from the database (single word or with multiple words).
For instance, here is example of all the spans for a simple sentence. It is a list that store lists of strings:
[['new'], ['york'], ['is'], ['the'], ['big'], ['apple']]
[['new'], ['york'], ['is'], ['the'], ['big', 'apple']]
[['new'], ['york'], ['is'], ['the', 'big'], ['apple']]
[['new'], ['york'], ['is'], ['the', 'big', 'apple']]
[['new'], ['york'], ['is', 'the'], ['big'], ['apple']]
[['new'], ['york'], ['is', 'the'], ['big', 'apple']]
[['new'], ['york'], ['is', 'the', 'big'], ['apple']]
[['new'], ['york'], ['is', 'the', 'big', 'apple']]
[['new'], ['york', 'is'], ['the'], ['big'], ['apple']]
[['new'], ['york', 'is'], ['the'], ['big', 'apple']]
[['new'], ['york', 'is'], ['the', 'big'], ['apple']]
[['new'], ['york', 'is'], ['the', 'big', 'apple']]
[['new'], ['york', 'is', 'the'], ['big'], ['apple']]
[['new'], ['york', 'is', 'the'], ['big', 'apple']]
[['new'], ['york', 'is', 'the', 'big'], ['apple']]
[['new'], ['york', 'is', 'the', 'big', 'apple']]
[['new', 'york'], ['is'], ['the'], ['big'], ['apple']]
[['new', 'york'], ['is'], ['the'], ['big', 'apple']]
[['new', 'york'], ['is'], ['the', 'big'], ['apple']]
[['new', 'york'], ['is'], ['the', 'big', 'apple']]
[['new', 'york'], ['is', 'the'], ['big'], ['apple']]
[['new', 'york'], ['is', 'the'], ['big', 'apple']]
[['new', 'york'], ['is', 'the', 'big'], ['apple']]
[['new', 'york'], ['is', 'the', 'big', 'apple']]
[['new', 'york', 'is'], ['the'], ['big'], ['apple']]
[['new', 'york', 'is'], ['the'], ['big', 'apple']]
[['new', 'york', 'is'], ['the', 'big'], ['apple']]
[['new', 'york', 'is'], ['the', 'big', 'apple']]
[['new', 'york', 'is', 'the'], ['big'], ['apple']]
[['new', 'york', 'is', 'the'], ['big', 'apple']]
[['new', 'york', 'is', 'the', 'big'], ['apple']]
[['new', 'york', 'is', 'the', 'big', 'apple']]
Each sublist may or may not map to a concept. To find the best mapping, you can score each of the above line based on the number of concept that match.
Here is two of the above list of spans that have the best score according to the example knowledge base:
2 ~ [['new', 'york'], ['is'], ['the'], ['big', 'apple']]
2 ~ [['new', 'york'], ['is', 'the'], ['big', 'apple']]
So it guessed "new york" is concept and "big apple" is also a concept.
Here is the full code:
input = 'new york is the big apple'.split()
def spans(lst):
if len(lst) == 0:
yield None
for index in range(1, len(lst)):
for span in spans(lst[index:]):
if span is not None:
yield [lst[0:index]] + span
yield [lst]
knowledgebase = [
['new', 'york'],
['big', 'apple'],
]
out = []
scores = []
for span in spans(input):
score = 0
for candidate in span:
for uid, entity in enumerate(knowledgebase):
if candidate == entity:
score += 1
out.append(span)
scores.append(score)
leaderboard = sorted(zip(out, scores), key=lambda x: x[1])
for winner in leaderboard:
print(winner[1], ' ~ ', winner[0])
This can must be improved to associate list that match a concept to its concept identifier, and find a way to spell check everything (according to the knowledge base).