SpaCy 3: how to get the raw data used to train en_core_web_sm?

Question

I am new to SpaCy. I noticed that there are a number of NER categories listed in the documentation of all en_core_web models:

'CARDINAL', 
'DATE', 
'EVENT', 
'FAC', 
'GPE', 
'LANGUAGE', 
'LAW', 
'LOC', 
'MONEY', 
'NORP', 
'ORDINAL', 
'ORG', 
'PERCENT', 
'PERSON', 
'PRODUCT', 
'QUANTITY', 
'TIME', 
'WORK_OF_ART'

I need to access the raw data used to assign each word the correct category. In other words, what's the list of words labelled as 'WORK_OF_ART', and is this list available?

The reason I ask this question is that I want to build a custom model that uses some of the default NER categories, as well as my own.

To be perfectly clear, since it is not clear if you are aware of this from your question: the training data is labelled by hand, not using word lists, so you cannot reproduce models using just word lists. (For example, "Sears" can be a person or a company depending on context.) — polm23, Sep 09 '21 at 05:40
@polm23 this surely helps, thanks. I haven't managed to find the manually labelled data, though, from which I could extract the word lists I need. — Zizzipupp, Sep 09 '21 at 09:40
For English the models are trained on OntoNotes 5, which is available from the Language Data Consortium but is expensive. The lists of labelled words are not saved in the models in any form. Even the training data consists of sentences with marked words, there are no "word lists". — polm23, Sep 09 '21 at 12:35

score 1 · Accepted Answer · answered Sep 08 '21 at 04:16

Depending on which variant of en_core_web, the data varies,

Dataset	License	URL	web_sm	web_md	eweb_lg	web_trf
OntoNotes 5	LDC Non-Members	https://catalog.ldc.upenn.edu/LDC2013T19	✓	✓	✓	✓
Wordnet 3.0	WordNet License	https://wordnet.princeton.edu/download	✓	✓	✓	✓
ClearNLP Constituent-to-Dependency Conversion	Apache 2.0	dependency_conversion.md	✓	✓	✓	✓
GloVe Common Crawl	Apache 2.0	https://nlp.stanford.edu/projects/glove/	✕	✓	✓	✕
Roberta Base	???	Fairseq Roberta

The NER labelling scheme as described from https://spacy.io/models/en is from OntoNotes that contains NER tags, see Section 2.6 of https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf

The NER tags adopts the CONLL BIO format, see https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO and when read properly, each sentence should be a list of tuples, e.g. Get Stanford NER result through NLTK with IOB format

Also take a look at https://github.com/flairNLP/flair/ when it comes to training NER using Ontonotes, it might help.

SpaCy 3: how to get the raw data used to train en_core_web_sm?

1 Answers1