Loading a tokenizer on huggingface: AttributeError: 'AlbertTokenizer' object has no attribute 'vocab'

Question

I'm trying to load a huggingface model and tokenizer. This normally works really easily (I've done it with a dozen models):

from transformers import pipeline, BertForMaskedLM, BertForMaskedLM, AutoTokenizer, RobertaForMaskedLM, AlbertForMaskedLM, ElectraForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = BertForMaskedLM.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

But for some reason I'm getting an error when I'm trying to load this one:

tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ALBERT-xxlarge", use_fast=False)
model = AlbertForMaskedLM.from_pretrained("sultan/BioM-ALBERT-xxlarge")
tokenizer.vocab

I found this question related, but it seems like this was an issue in the git repo itself and not on huggingface. I checked the actual repo where this model is saved on huggingface (link) and it clearly has a vocab file (PubMD-30k-clean.vocab) like the rest of the models I loaded.

score 2 · Answer 1 · answered Aug 23 '22 at 19:24

2

There seems to be some issue with the tokenizer. It works, if you remove use_fast parameter or set it true, then you will be able to display the vocab file.

tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ALBERT-xxlarge", use_fast=True)
model = AlbertForMaskedLM.from_pretrained("sultan/BioM-ALBERT-xxlarge")
tokenizer.vocab

Output:

{'intervention': 7062,
 '▁tongue': 6911,
 '▁kit': 8341,
 '▁biosimilar': 26423,
 'bank': 19880,
 '▁diesel': 20349,
 'SOD': 6245,
 'iri': 17739,
....

answered Aug 23 '22 at 19:24

rbi

356
1
4

Without `use_fast=False` I get an error: `ImportError: AlbertConverter requires the protobuf library but it was not found in your environment. Checkout the instructions on the installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones that match your environment.`. With `use_fast=True` I get the same error – Penguin Aug 23 '22 at 19:30
Okay found the solution to this [here](https://stackoverflow.com/questions/38680593/importerror-no-module-named-google-protobuf). Need to run `conda install protobuf` – Penguin Aug 23 '22 at 19:36
Sorry, this error didn't occur for me on colab. Happy you solved it! – rbi Aug 23 '22 at 20:36

Loading a tokenizer on huggingface: AttributeError: 'AlbertTokenizer' object has no attribute 'vocab'

1 Answers1