3

I want to use spaCy for Entity Linking (EL). I already trained a spaCy Named Entity Recognition (NER) model with custom labels on my domain-specific corpus. However my following example will be using the regular entity labels PERSON and LOCATION.

Setting aliases in the Knowledge Base (KB), the KB returns candidates for occurences of recognized entities, e.g. candidates for "Paris" can be the Wikidata entry Q47899 (Paris Hilton), Q7137357 (Paris Themmen), Q5214166 (Dan Paris), Q90 (Paris, capital of France), or Q830149 (Paris, county seat of Lamar County, Texas, United States).

My question concerns the recognized entity label. If the NER recognizes "Paris" as PERSON, this excludes Q90 (Paris, capital of France) and Q830149 (Paris, county seat of Lamar County, Texas, United States) from the candidates, leaving 3 candidates. Whereas if "Paris" was recognized as LOCATION, there are only the other 2 candidates.

Is it possible to advise the KB or EL model somehow from which set of entities to chose the candidates, given the detected NER label? Before or after training the EL model?

LBoss
  • 496
  • 6
  • 15

2 Answers2

2

This is currently not implemented in spaCy. Generally speaking, these would be the steps needed to get to the functionality you want:

  • Create some sort of mapping between your KB entities (Wikidata identifiers) and your NER Labels. This won't be exactly trivial. You need to either parse the wikidata "instance of" meta information, or use the Wikipedia classification system which has its pitfalls. Either way, you need to end up with an automated way of defining that Q830149 is-a "LOCATION" etc.
  • Store the "NER labels" for each entity. This could be done in the KB, but then the Cython structures need to be edited.
  • Reimplement the candidate generation (currently part of the KB: get_candidates method) to take a textual mention + its NER label, and only output relevant candidates for that specific label.

One caveat I'd like to point out, is that this approach may amplify errors from the NER step. Imagine that you're talking about Paris, the capital, but your NER gets it wrong and tags it as a "PERSON". With the approach described here, the NEL won't be able to recover from that, and will output the most likely person it can find, though none of them are correct.

Another approach would be to not change the candidate generator, but take the NER label into account as part of the scoring mechanism in the entity_linker pipe. Currently, it already combines two scores: one from the prior probability (using stats from a large training corpus), and one from the context (using ML and sentence similarity). The aspect of matching NER label could be included into that score, and then there will still be a chance of recognizing "PARIS" as the correct entity, even when its NER label is wrong. But it depends on how strict you'd want to enforce that.

Sofie VL
  • 2,931
  • 2
  • 12
  • 22
  • Thank you @SofieVL for the detailed answer. – LBoss Oct 13 '20 at 10:14
  • @LBoss take a look at this one: https://github.com/AdirthaBorgohain/NER-RE, maybe it can help. I'm trying to figure it out how to create a candidate function as well. – rdemorais Jul 23 '22 at 14:15
0

I just had an idea myself. I guess it would be possible to have 2 pipes and train a seperate NER-model for each entity type. Then have a seperate KB and EL-model in each pipe. Then combine the results of the pipes.

LBoss
  • 496
  • 6
  • 15
  • 1
    Good idea! You could train one NER model, but separate NEL models for each NER label. You'll still have to find a way to map the Wikidata entities to your NER labels though, to separate out the different KB's. – Sofie VL Oct 13 '20 at 10:20