0

I am trying to train spaCy models using just the python -m spacy train command line tool without writing any code of my own.

I have a training set of documents to which I have added OIL_COMPANY entity spans. I used gold.docs_to_json to create training files in the JSON-serializable format.

I can train starting from an empty model. However, if I try to extend the existing en_core_web_lg model I see the following error.

KeyError: "[E022] Could not find a transition with the name 'B-OIL_COMPANY' in the NER model."

So I need to be able to tell the command line tool to add OIL_COMPANY to an existing list of NER labels. The discussion in Training an additional entity type shows how to do this in code by calling add_label on the NER pipeline, but I don't see any command line option that does this.

Is it possible to extend an existing NER model to new entities with just the command line training tools, or do I have to write code?

W.P. McNeill
  • 16,336
  • 12
  • 75
  • 111

2 Answers2

1

Ines answered this for me on the Prodigy support forum.

I think what's happening here is that the spacy train command expects the base model you want to update to already have all labels added that you want to train. (It processes the data as a stream, so it's not going to compile all labels upfront and silently add them on the fly.) So if you want to update an existing pretrained model and add a new label, you should be able to just add the label and save out the base model:

ner = nlp.get_pipe("ner") ner.add_label("YOUR_LABEL")
nlp.to_disk("./base-model")

This isn't quite writing no code but it's pretty close.

W.P. McNeill
  • 16,336
  • 12
  • 75
  • 111
0

See this link for the CLI in spaCy.

Train a model. Expects data in spaCy’s JSON format. On each epoch, a model will be saved out to the directory. Accuracy scores and model details will be added to a meta.json to allow packaging the model using the package command.

python -m spacy train [lang] [output_path] [train_path] [dev_path]
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping]
[--n-examples] [--use-gpu] [--version] [--meta-path] [--init-tok2vec]
[--parser-multitasks] [--entity-multitasks] [--gold-preproc] [--noise-level]
[--orth-variant-level] [--learn-tokens] [--textcat-arch] [--textcat-multilabel]
[--textcat-positive-label] [--verbose]
APhillips
  • 1,175
  • 9
  • 17