1

I am new to SpaCy. I noticed that there are a number of NER categories listed in the documentation of all en_core_web models:

'CARDINAL', 
'DATE', 
'EVENT', 
'FAC', 
'GPE', 
'LANGUAGE', 
'LAW', 
'LOC', 
'MONEY', 
'NORP', 
'ORDINAL', 
'ORG', 
'PERCENT', 
'PERSON', 
'PRODUCT', 
'QUANTITY', 
'TIME', 
'WORK_OF_ART'

I need to access the raw data used to assign each word the correct category. In other words, what's the list of words labelled as 'WORK_OF_ART', and is this list available?

The reason I ask this question is that I want to build a custom model that uses some of the default NER categories, as well as my own.

alvas
  • 115,346
  • 109
  • 446
  • 738
Zizzipupp
  • 1,301
  • 1
  • 11
  • 27
  • 1
    To be perfectly clear, since it is not clear if you are aware of this from your question: the training data is labelled by hand, not using word lists, so you cannot reproduce models using just word lists. (For example, "Sears" can be a person or a company depending on context.) – polm23 Sep 09 '21 at 05:40
  • @polm23 this surely helps, thanks. I haven't managed to find the manually labelled data, though, from which I could extract the word lists I need. – Zizzipupp Sep 09 '21 at 09:40
  • For English the models are trained on OntoNotes 5, which is available from the Language Data Consortium but is expensive. The lists of labelled words are not saved in the models in any form. Even the training data consists of sentences with marked words, there are no "word lists". – polm23 Sep 09 '21 at 12:35

1 Answers1

1

Depending on which variant of en_core_web, the data varies,

Dataset License URL web_sm web_md eweb_lg web_trf
OntoNotes 5 LDC Non-Members https://catalog.ldc.upenn.edu/LDC2013T19
Wordnet 3.0 WordNet License https://wordnet.princeton.edu/download
ClearNLP Constituent-to-Dependency Conversion Apache 2.0 dependency_conversion.md
GloVe Common Crawl Apache 2.0 https://nlp.stanford.edu/projects/glove/
Roberta Base ??? Fairseq Roberta

The NER labelling scheme as described from https://spacy.io/models/en is from OntoNotes that contains NER tags, see Section 2.6 of https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf

The NER tags adopts the CONLL BIO format, see https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO and when read properly, each sentence should be a list of tuples, e.g. Get Stanford NER result through NLTK with IOB format

Also take a look at https://github.com/flairNLP/flair/ when it comes to training NER using Ontonotes, it might help.

alvas
  • 115,346
  • 109
  • 446
  • 738