spacy create new language model with data from corpus

Question

I am trying to create a new language model (Luxembourgish) in spaCy, but I am confused on how to do this.

I followed the instructions on their website and did a similar thing as in this post. But what I do not understand is, how to add data like a vocab or wordvectors. (e.g. "fill" the language template)

I get that there are some dev tools for same of these operations, but their execution is poorly documented so I do not get how to install and use them properly, especially as they seem to be in python 2.7 which clashes with my spacy installation as it uses python 3.

As for now I have a corpus.txt (from a wikipediadump) on which I want to train and a language template with the defaults like stop_words.py, tokenizer_exceptions.py etc. that I created and filled by hand.

Anyone ever done this properly and could help me here?

score 14 · Accepted Answer · answered May 07 '18 at 13:07

There are three main components of a "language model" in spaCy: the "static" language-specific data shipped in Python (tokenizer exceptions, stop words, rules for mapping fine-grained to coarse-grained part-of-speech tags), the statistical model trained to predict part-of-speech tags, dependencies and named entities (trained on a large labelled corpus and included as binary weights) and optional word vectors that can be converted and added before or after training. You can also train your own vectors on your raw text using a library like Gensim and then add them to spaCy.

spaCy v2.x allows you to train all pipeline components independently or in on go, so you can train the tagger, parser and entity recognizer on your data. All of this requires labelled data. If you're training a new language from scratch, you normally use an existing treebank. Here's an example of the Universal Dependencies corpus for Spanish (which is also the one that was used to train spaCy's Spanish model). You can then convert the data to spaCy's JSON format and use the spacy train command to train a model. For example:

git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora
mkdir ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-train.json ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-dev.json ancora-json
mkdir models
python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json

I don't know what's in your corpus.txt and whether it's fully labelled or only raw text. (I also don't know of any existing resources for Luxembourgish – sounds like that's potentially quite hard to find!) If your data is labelled, you can convert it to spaCy's format using one of the built-in converters or your own little script. If your corpus consists of only raw text, you need to label it first and see if it's suitable to train a general language model. Ultimately, this comes down to experimenting – but here are some strategies:

Label your entire corpus manually for each component – e.g. part-of-speech tags if you want to train the tagger, dependency labels if you want to train the parser, and entity spans if you want to train the entity recognizer. You'll need a lot of data though – ideally, a corpus of a similar size to the Universal Dependencies ones.
Experiment with teaching an existing pre-trained model Luxembourgish – for example the German model. This might sound strange, but it's not an uncommon strategy. Instead of training from scratch, you post-train the existing model with examples of Luxembourgish (ideally until its predictions on your Luxembourgish text are good enough). You can also create more training data by running the German model over your Luxembourgish text and extracting and correcting its mistakes (see here for details).

Remember that you always need evaluation data, too (also referred to as "development data" in the docs). This is usually a random portion of your labelled data that you hold back during training and use to determine whether your model is improving.

Well I read this but I didn't exactly understand where word vectors come in, I have my own glove word vectors , and want to use it in my POSTagger, where should I use vectors? currently I using `set_vectors` in my `nlp.vocab`. — ᴀʀᴍᴀɴ, May 09 '18 at 07:26
Is there any follow up on this issue? I also want to train my own model (POS and NER) with my embeddings. — Miguel, Oct 18 '18 at 15:45

spacy create new language model with data from corpus

1 Answers1