There are three main components of a "language model" in spaCy: the "static" language-specific data shipped in Python (tokenizer exceptions, stop words, rules for mapping fine-grained to coarse-grained part-of-speech tags), the statistical model trained to predict part-of-speech tags, dependencies and named entities (trained on a large labelled corpus and included as binary weights) and optional word vectors that can be converted and added before or after training. You can also train your own vectors on your raw text using a library like Gensim and then add them to spaCy.
spaCy v2.x allows you to train all pipeline components independently or in on go, so you can train the tagger, parser and entity recognizer on your data. All of this requires labelled data. If you're training a new language from scratch, you normally use an existing treebank. Here's an example of the Universal Dependencies corpus for Spanish (which is also the one that was used to train spaCy's Spanish model). You can then convert the data to spaCy's JSON format and use the spacy train
command to train a model. For example:
git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora
mkdir ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-train.json ancora-json
python -m spacy convert UD_Spanish-AnCora/es_ancora-ud-dev.json ancora-json
mkdir models
python -m spacy train es models ancora-json/es_ancora-ud-train.json ancora-json/es_ancora-ud-dev.json
I don't know what's in your corpus.txt
and whether it's fully labelled or only raw text. (I also don't know of any existing resources for Luxembourgish – sounds like that's potentially quite hard to find!) If your data is labelled, you can convert it to spaCy's format using one of the built-in converters or your own little script. If your corpus consists of only raw text, you need to label it first and see if it's suitable to train a general language model. Ultimately, this comes down to experimenting – but here are some strategies:
- Label your entire corpus manually for each component – e.g. part-of-speech tags if you want to train the tagger, dependency labels if you want to train the parser, and entity spans if you want to train the entity recognizer. You'll need a lot of data though – ideally, a corpus of a similar size to the Universal Dependencies ones.
- Experiment with teaching an existing pre-trained model Luxembourgish – for example the German model. This might sound strange, but it's not an uncommon strategy. Instead of training from scratch, you post-train the existing model with examples of Luxembourgish (ideally until its predictions on your Luxembourgish text are good enough). You can also create more training data by running the German model over your Luxembourgish text and extracting and correcting its mistakes (see here for details).
Remember that you always need evaluation data, too (also referred to as "development data" in the docs). This is usually a random portion of your labelled data that you hold back during training and use to determine whether your model is improving.