20

I want to use spacy's pretrained BERT model for text classification but I'm a little confused about cased/uncased models. I read somewhere that cased models should only be used when there is a chance that letter casing will be helpful for the task. In my specific case: I am working with German texts. And in German all nouns start with the capital letter. So, I think, (correct me if I'm wrong) that this is the exact situation where cased model must be used. (There is also no uncased model available for German in spacy).

But what must be done with data in this situation? Should I (while preprocessing train data) leave it as it is (by that I mean not using the .lower() function) or it doesn't make any difference?

Oleg Ivanytskyi
  • 959
  • 2
  • 12
  • 28

3 Answers3

18

As a non-German-speaker, your comment about nouns being uppercase does make it seem like case is more relevant for German than it might be for English, but that doesn't obviously mean that a cased model will give better performance on all tasks.

For something like part-of-speech detection, case would probably be enormously helpful for the reason you describe, but for something like sentiment analysis, it's less clear whether the added complexity of having a much larger vocabulary is worth the benefits. (As a human, you could probably imagine doing sentiment analysis with all lowercase text just as easily.)

Given that the only model available is the cased version, I would just go with that - I'm sure it will still be one of the best pretrained German models you can get your hands on. Cased models have separate vocab entries for differently-cased words (e.g. in english the and The will be different tokens). So yes, during preprocessing you wouldn't want to remove that information by calling .lower(), just leave the casing as-is.

jayelm
  • 7,236
  • 5
  • 43
  • 61
6

In simple terms, BERT cased doesn't lowercase the word starting with a capital letter for example in the case of Nouns in the German language.

BERT cased is helpful where the accent plays an important role. For example schön in German

If we convert schön to schon using BERT uncased, it will have a different meaning. schön means beautiful whereas schon means already

1

The difference between "BERT cased" and "BERT uncased" can to finded in different contexts. For example, in the dialogs system, the users rarely put the text in their correct form, so, is ordinary to find the words in lower case. Maybe, in this case, the BERT in uncased have an advantage.

M_Bueno
  • 11
  • 1