How to perform entity linking to local knowledge graph?

Question

I'm building my own knowledge base from scratch, using articles online.

I am trying to map the entities from my scraped SPO triples (the Subject and potentially the Object) to my own record of entities which consist of listed companies which I scraped from some other website.

I've researched most of the libraries, and the method are focused on mapping entities to big knowledge bases like Wikipedia, YAGO, etc., but I'm not really sure how to apply those techniques to my own knowledge base.

Currently, I've found the NEL Python package that claims to be able to do so, but I don't quite understand the documentation, and it focus only on a Wikipedia data dump.

Is there any techniques or libraries that allows me to do so?

I'm building a knowledge graph that stores information of listed companies. These are often not available in Wikipedia, unless they are super big company. — Alex Ramses, May 25 '19 at 20:42
I can not find the NIL Python package, can you provide a link please? — amirouche, May 25 '19 at 23:22
I will put together something that solve your problem. That said, the big issue here, is that you want to extract triple from text, that is the difficult part. The rest database, and crawling / scraping can be considered boilerplate. — amirouche, May 25 '19 at 23:23
I believe I had made a typo on my post, the package should be NEL, which can be found here: https://nel.readthedocs.io/en/latest/ — Alex Ramses, Jun 02 '19 at 16:00

score 0 · Answer 1 · answered Sep 30 '19 at 11:06

I assume you have something similar to wikidata knowledge base that is a giant list of concepts with aliases.

More or less this can be represented as follow:

C1 new york
C1 nyc
C1 big apple

Now the link a spans of a sentence to the above KB, for single words it is easy, you just have to setup a index that maps a single word concept to an identifier.

The difficult part is linking multiple word concepts or phrasal concepts like "new york" or "big apple".

To achieve that I use an algorithm that splits a sentence into all the slices possible. I call those "spans". Then try to match individual span or group of words with a concept from the database (single word or with multiple words).

For instance, here is example of all the spans for a simple sentence. It is a list that store lists of strings:

[['new'], ['york'], ['is'], ['the'], ['big'], ['apple']]
[['new'], ['york'], ['is'], ['the'], ['big', 'apple']]
[['new'], ['york'], ['is'], ['the', 'big'], ['apple']]
[['new'], ['york'], ['is'], ['the', 'big', 'apple']]
[['new'], ['york'], ['is', 'the'], ['big'], ['apple']]
[['new'], ['york'], ['is', 'the'], ['big', 'apple']]
[['new'], ['york'], ['is', 'the', 'big'], ['apple']]
[['new'], ['york'], ['is', 'the', 'big', 'apple']]
[['new'], ['york', 'is'], ['the'], ['big'], ['apple']]
[['new'], ['york', 'is'], ['the'], ['big', 'apple']]
[['new'], ['york', 'is'], ['the', 'big'], ['apple']]
[['new'], ['york', 'is'], ['the', 'big', 'apple']]
[['new'], ['york', 'is', 'the'], ['big'], ['apple']]
[['new'], ['york', 'is', 'the'], ['big', 'apple']]
[['new'], ['york', 'is', 'the', 'big'], ['apple']]
[['new'], ['york', 'is', 'the', 'big', 'apple']]
[['new', 'york'], ['is'], ['the'], ['big'], ['apple']]
[['new', 'york'], ['is'], ['the'], ['big', 'apple']]
[['new', 'york'], ['is'], ['the', 'big'], ['apple']]
[['new', 'york'], ['is'], ['the', 'big', 'apple']]
[['new', 'york'], ['is', 'the'], ['big'], ['apple']]
[['new', 'york'], ['is', 'the'], ['big', 'apple']]
[['new', 'york'], ['is', 'the', 'big'], ['apple']]
[['new', 'york'], ['is', 'the', 'big', 'apple']]
[['new', 'york', 'is'], ['the'], ['big'], ['apple']]
[['new', 'york', 'is'], ['the'], ['big', 'apple']]
[['new', 'york', 'is'], ['the', 'big'], ['apple']]
[['new', 'york', 'is'], ['the', 'big', 'apple']]
[['new', 'york', 'is', 'the'], ['big'], ['apple']]
[['new', 'york', 'is', 'the'], ['big', 'apple']]
[['new', 'york', 'is', 'the', 'big'], ['apple']]
[['new', 'york', 'is', 'the', 'big', 'apple']]

Each sublist may or may not map to a concept. To find the best mapping, you can score each of the above line based on the number of concept that match.

Here is two of the above list of spans that have the best score according to the example knowledge base:

2  ~  [['new', 'york'], ['is'], ['the'], ['big', 'apple']]
2  ~  [['new', 'york'], ['is', 'the'], ['big', 'apple']]

So it guessed "new york" is concept and "big apple" is also a concept.

Here is the full code:

input = 'new york is the big apple'.split()


def spans(lst):
    if len(lst) == 0:
        yield None
    for index in range(1, len(lst)):
        for span in spans(lst[index:]):
            if span is not None:
                yield [lst[0:index]] + span
    yield [lst]

knowledgebase = [
    ['new', 'york'],
    ['big', 'apple'],
]

out = []
scores = []

for span in spans(input):
    score = 0
    for candidate in span:
        for uid, entity in enumerate(knowledgebase):
            if candidate == entity:
                score += 1
    out.append(span)
    scores.append(score)

leaderboard = sorted(zip(out, scores), key=lambda x: x[1])

for winner in leaderboard:
    print(winner[1], ' ~ ', winner[0])

This can must be improved to associate list that match a concept to its concept identifier, and find a way to spell check everything (according to the knowledge base).

Thanks. This is useful, but it doesn't quite handle cases like abbreviation, which appears frequently in articles. Also, could you elaborate on the part "setup a index that maps a single word concept to an identifier."? — Alex Ramses, Oct 08 '19 at 09:39
@AlexRamses give me an example of abbreviation that is not handled by the above algorithm. Mind the fact that a) the algorithm works in coop with the `knowledgebase` b) this is a simple version of the algorithm. — amirouche, Oct 21 '19 at 08:39
For example: `input = 'ny is the big apple'.split()` as **ny** is a common abbreviation for New York. — Alex Ramses, Oct 21 '19 at 09:05

How to perform entity linking to local knowledge graph?

1 Answers1

Linked