Why spacy morphologizer doesn't work when we use a custom tokenizer?

Question

I don't understand why when i'm doing this

import spacy
from copy import deepcopy
nlp = spacy.load("fr_core_news_lg")

class MyTokenizer:
    def __init__(self, tokenizer):
        self.tokenizer = deepcopy(tokenizer)
    def __call__(self, text):
        return self.tokenizer(text)

nlp.tokenizer = MyTokenizer(nlp.tokenizer)
doc = nlp("Un texte en français.")

Tokens don't have any morph assigned

print([tok.morph for tok in doc])
> ['','','','','']

Is this behavior expected? If yes, why ? (spacy v3.0.7)

I don't know why that would happen, and I can't reproduce this in 3.2.0. — polm23, Mar 06 '22 at 05:39
Note that if you have an actual different tokenizer, the morphologizer could do something like this for tokens it's never seen because tokenization changed. — polm23, Mar 06 '22 at 05:39

score 3 · Accepted Answer · answered Mar 07 '22 at 09:52

The pipeline expects nlp.vocab and nlp.tokenizer.vocab to refer to the exact same Vocab object, which isn't the case after running deepcopy.

I admit that I'm not entirely sure off the top of my head why you end up with empty analyses instead of more specific errors, but I think the MorphAnalysis objects, which are stored centrally in the vocab in vocab.morphology, end up out-of-sync between the two vocabs.

Why spacy morphologizer doesn't work when we use a custom tokenizer?

1 Answers1