I am trying to follow the example here: https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking. But I am just confused as to what is in the training data. Is it everything from Wikipedia? Say I just need training data on a few entities. For example, E1, E2, and E3. Does the example allow for me to specify only a few entities that I want to disambiguate?
1 Answers
[UPDATE] Note that this code base was moved to https://github.com/explosion/projects/tree/master/nel-wikipedia (spaCy v2)
If you run the scripts as provided in https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking, they will indeed create a training dataset from Wikipedia you can use to train a generic model on.
If you're looking to train a more limited model, ofcourse you can feed in your own training set. A toy example can be found here: https://github.com/explosion/spaCy/blob/master/examples/training/train_entity_linker.py, where you can deduce the format of the training data:
def sample_train_data():
train_data = []
# Q2146908 (Russ Cochran): American golfer
# Q7381115 (Russ Cochran): publisher
text_1 = "Russ Cochran his reprints include EC Comics."
dict_1 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}
train_data.append((text_1, {"links": dict_1}))
text_2 = "Russ Cochran has been publishing comic art."
dict_2 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}
train_data.append((text_2, {"links": dict_2}))
text_3 = "Russ Cochran captured his first major title with his son as caddie."
dict_3 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}
train_data.append((text_3, {"links": dict_3}))
text_4 = "Russ Cochran was a member of University of Kentucky's golf team."
dict_4 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}
train_data.append((text_4, {"links": dict_4}))
return train_data
This example in train_entity_linker.py
shows you how the model learns to disambiguate "Russ Cochran" the golfer (Q2146908
) from the publisher (Q7381115
). Note that it is just a toy example: a realistic application would require a larger knowledge base with accurate prior frequencies (as you can get by running the Wikipedia/Wikidata scripts), and ofcourse you would need many more sentences and lexical variety to expect the Machine Learning model to pick up proper clues and generalize efficiently to unseen text.

- 2,931
- 2
- 12
- 22
-
Thanks for the response. So if I were to use the Wikipedia script, what exactly is the format of the training data? In other words, does it contains training examples for ALL entities that can be found on Wikipedia? If so, I should just be able to filter for ones I need, no? – formicaman Feb 06 '20 at 15:47
-
Additionally, in my case, the entities that I would need to train it on could change every once in a while -- that's why I would like to use the Wikipedia data so I could just look up that entities that I need from the dump. – formicaman Feb 06 '20 at 15:47
-
1Sure, you can do custom filtering on the generated training data from Wikipedia. This training file is `gold_entities.jsonl` and contains one document per line + all entity annotations (offset + database ID) in that document. – Sofie VL Feb 06 '20 at 21:09
-
Is the path to the WP data supposed to be the compressed file or the uncompressed file? – formicaman Feb 10 '20 at 20:17
-
just keep the uncompressed `.bz2` files - you don't want to uncompress this :-) – Sofie VL Feb 11 '20 at 10:56
-
Appreciate it! Do you have any insight into how I could filter the XML or speed up the process considering I am only looking for a few entities? I believe the script says it would take 10 hours. – formicaman Feb 11 '20 at 13:56
-
Or I am thinking about just using the limit_prior and limit_train parameters, but wanted to check that there were 1B lines. – formicaman Feb 11 '20 at 14:51
-
And sorry for one last thing - what's the 'entity definition'? I understand what the description is, but not the definition. – formicaman Feb 11 '20 at 16:09