5

I'm trying to train a Spacy Entity Linking model using Wikidata and Wikipedia, using the scripts in https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking. I've generated the KB and moved to training the model, but that is not done yet after more than a week. How long should that take normally? (I'm not using a GPU)

Alternatively, is there a pretrained Wikidata entity linking model I can use?

Thanks

Alessandro
  • 51
  • 3

2 Answers2

4

As of October 2019, Spacy does not yet provide a pre-trained model. They only offer the framework and the functionality.

I recommend you comment on this GitHub thread with your request, and your question about pre-train/training times.

https://github.com/explosion/spaCy/issues/4511

David Bernat
  • 324
  • 2
  • 11
  • Thanks. I did post a comment of the github thread, but no solution yet. I believe the issue is caused by the incredible amount of memory needed to train the model – it crashes even with more than 300Gb RAM. – Alessandro Dec 03 '19 at 15:24
  • You're welcome. Several people seem to have made comments referring to the memory issue. That doesn't make sense at all and simply suggests a design error in their data loader. The model itself has no reason to be that large. Hopefully they will fix the model and publish a model soon enough! :-) – David Bernat Dec 09 '19 at 00:51
  • 2
    The memory issue has been fixed since Jan 6: https://github.com/explosion/spaCy/pull/4811 – Sofie VL Mar 15 '20 at 17:42
1

This PR to spaCy has modifications that allow for training on a larger dataset. Instructions are also updated.

Union find
  • 7,759
  • 13
  • 60
  • 111
  • 1
    That PR is not about "allowing for training on a smaller dataset", that was always a feature. However that PR is in fact quite relevant, as it significantly reduces the memory needed to train the NEL component on a LARGER dataset. – Sofie VL Mar 15 '20 at 17:46
  • @SofieVL Ack, typo there.. Fixed. – Union find Mar 15 '20 at 18:36
  • Ah, yes, that does make more sense, thanks for updating :-) – Sofie VL Mar 15 '20 at 20:24
  • The question in that thread -- how many lines are needed -- do you have an estimate? – Union find Mar 15 '20 at 20:35
  • I replied to the thread as well - the experiments I did with 165K lines (i.e. articles) seem to result in a decent model. – Sofie VL Mar 16 '20 at 07:23