4

I don't understand why when i'm doing this

import spacy
from copy import deepcopy
nlp = spacy.load("fr_core_news_lg")

class MyTokenizer:
    def __init__(self, tokenizer):
        self.tokenizer = deepcopy(tokenizer)
    def __call__(self, text):
        return self.tokenizer(text)

nlp.tokenizer = MyTokenizer(nlp.tokenizer)
doc = nlp("Un texte en français.")

Tokens don't have any morph assigned

print([tok.morph for tok in doc])
> ['','','','','']

Is this behavior expected? If yes, why ? (spacy v3.0.7)

Vee
  • 297
  • 1
  • 7
  • I don't know why that would happen, and I can't reproduce this in 3.2.0. – polm23 Mar 06 '22 at 05:39
  • Note that if you have an actual different tokenizer, the morphologizer could do something like this for tokens it's never seen because tokenization changed. – polm23 Mar 06 '22 at 05:39

1 Answers1

3

The pipeline expects nlp.vocab and nlp.tokenizer.vocab to refer to the exact same Vocab object, which isn't the case after running deepcopy.

I admit that I'm not entirely sure off the top of my head why you end up with empty analyses instead of more specific errors, but I think the MorphAnalysis objects, which are stored centrally in the vocab in vocab.morphology, end up out-of-sync between the two vocabs.

aab
  • 10,858
  • 22
  • 38