0

No Language Left Behind (NLLB) is the machine translation model available on https://huggingface.co/facebook/nllb-200-distilled-600M

It supports a list of languages but to add a new language in the tokenizer, the follow code runs successfully but the language token didn't get add to the tokenizer object.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
tokenizer.additional_special_tokens.append('aym_Latn')

print('aym_Latn' in tokenizer.additional_special_tokens)

tokenizer

[out]:

False

NllbTokenizerFast(name_or_path='facebook/nllb-200-distilled-600M', vocab_size=256204, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True), 
  'additional_special_tokens': ['ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab', 'afr_Latn', 'ajp_Arab', 'aka_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'ars_Arab', 'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl', 'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt', 'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn', 'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn', 'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'pes_Arab', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn', 'fuv_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn', 'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn', 'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_Latn', 'kam_Latn', 'kan_Knda', 'kas_Arab', 'kas_Deva', 'kat_Geor', 'knc_Arab', 'knc_Latn', 'kaz_Cyrl', 'kbp_Latn', 'kea_Latn', 'khm_Khmr', 'kik_Latn', 'kin_Latn', 'kir_Cyrl', 'kmb_Latn', 'kon_Latn', 'kor_Hang', 'kmr_Latn', 'lao_Laoo', 'lvs_Latn', 'lij_Latn', 'lim_Latn', 'lin_Latn', 'lit_Latn', 'lmo_Latn', 'ltg_Latn', 'ltz_Latn', 'lua_Latn', 'lug_Latn', 'luo_Latn', 'lus_Latn', 'mag_Deva', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'min_Latn', 'mkd_Cyrl', 'plt_Latn', 'mlt_Latn', 'mni_Beng', 'khk_Cyrl', 'mos_Latn', 'mri_Latn', 'zsm_Latn', 'mya_Mymr', 'nld_Latn', 'nno_Latn', 'nob_Latn', 'npi_Deva', 'nso_Latn', 'nus_Latn', 'nya_Latn', 'oci_Latn', 'gaz_Latn', 'ory_Orya', 'pag_Latn', 'pan_Guru', 'pap_Latn', 'pol_Latn', 'por_Latn', 'prs_Arab', 'pbt_Arab', 'quy_Latn', 'ron_Latn', 'run_Latn', 'rus_Cyrl', 'sag_Latn', 'san_Deva', 'sat_Beng', 'scn_Latn', 'shn_Mymr', 'sin_Sinh', 'slk_Latn', 'slv_Latn', 'smo_Latn', 'sna_Latn', 'snd_Arab', 'som_Latn', 'sot_Latn', 'spa_Latn', 'als_Latn', 'srd_Latn', 'srp_Cyrl', 'ssw_Latn', 'sun_Latn', 'swe_Latn', 'swh_Latn', 'szl_Latn', 'tam_Taml', 'tat_Cyrl', 'tel_Telu', 'tgk_Cyrl', 'tgl_Latn', 'tha_Thai', 'tir_Ethi', 'taq_Latn', 'taq_Tfng', 'tpi_Latn', 'tsn_Latn', 'tso_Latn', 'tuk_Latn', 'tum_Latn', 'tur_Latn', 'twi_Latn', 'tzm_Tfng', 'uig_Arab', 'ukr_Cyrl', 'umb_Latn', 'urd_Arab', 'uzn_Latn', 'vec_Latn', 'vie_Latn', 'war_Latn', 'wol_Latn', 'xho_Latn', 'ydd_Hebr', 'yor_Latn', 'yue_Hant', 'zho_Hans', 'zho_Hant', 'zul_Latn']}, clean_up_tokenization_spaces=True)

There's some solution on https://github.com/huggingface/tokenizers/issues/247 but note that if you do something like overriding the additional special tokens, the original ones will be lost, i.e.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
tokenizer.add_special_tokens({'additional_special_tokens': ['aym_Latn']})

print('aym_Latn' in tokenizer.additional_special_tokens)
tokenizer

[out]:

True

NllbTokenizerFast(name_or_path='facebook/nllb-200-distilled-600M', vocab_size=256204, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True), 
  'additional_special_tokens': ['aym_Latn']}, clean_up_tokenization_spaces=True)

How to add new language to NLLB tokenizer in Huggingface?

My questions in parts are:

  • (part1) How to add the special tokens for new languages? (without forgetting all the other languages it's trained on)
  • (part2) After adding the special tokens, are there additional steps to properly tokenize inputs? E.g. change/set the language token assignment function
  • (part3) After adding the special tokens and any additional steps, when processing the inputs, should the special token be pre-pended in the raw string? Or is there a special function in NLLB tokenizer to automatically add it in when initializing the tokenizer?

The desired goal is to be able to do this with pipeline automatically detecting the new added language after fine-tuning the model.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")

translator = pipeline(‘translation’, 
    model=model, tokenizer=tokenizer, 
    src_lang="aym_Latn", tgt_lang="spa_Latn", 
    max_length = 512
)

pipeline("Phisqha alwa pachaw sartapxta ukatx utaj jak’an 3 millas ukaruw muytir sarapxta.")

The pipeline method might not be possible since there might be some implicit function controlling how the tokenizer interacts with the languages, in that case, at least this should work:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")

# In this case, how do we add the `src_lang` and `tgt_lang`?
text = "Phisqha alwa pachaw sartapxta ukatx utaj jak’an 3 millas ukaruw muytir sarapxta."

model.generate(**tokenizer([text], return_tensors="pt", padding=True))

In the case, how do we add the src_lang and tgt_lang?

alvas
  • 115,346
  • 109
  • 446
  • 738

1 Answers1

0

Let's try to break these down into 3 separate questions...

Part 1: How to add the special tokens for new languages?

The issue you're seeing is that calling tokenizer.add_special_tokens doesn't append the new special tokens to the existing ones, it replaces them. To append, you should first retrieve the existing special tokens, add your new token to the list, and then call add_special_tokens with the updated list.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
new_special_tokens = tokenizer.additional_special_tokens + ['aym_Latn']
tokenizer.add_special_tokens({'additional_special_tokens': new_special_tokens})

print('aym_Latn' in tokenizer.additional_special_tokens)

Part 2: After adding the special tokens, are there additional steps to properly tokenize inputs?

Once the special tokens have been added, you don't need to do anything else to properly tokenize the inputs. The tokenizer will recognize these new special tokens and handle them appropriately during tokenization. Remember though that the model must be trained or fine-tuned to understand this new token.

Part 3: After adding the special tokens and any additional steps, when processing the inputs, should the special token be pre-pended in the raw string? Or is there a special function in NLLB tokenizer to automatically add it in when initializing the tokenizer?

For the NLLB model, you would need to prepend the special token to your input string.

input_string = "This is a test."
language_token = "aym_Latn"
tokenized_input = tokenizer(language_token + input_string)

Edit to respond to OP Questions...

Won't new_special_tokens = tokenizer.additional_special_tokens + ['aym_Latn'] shift the vocab IDs? And would that affect the main tokenizer vocabs IDs? If not, there must be some sort of limit to how many tokens we can add, before it affects the main tokenizer's ID, right?

When you add a new token using tokenizer.add_tokens() or tokenizer.add_special_tokens(), the Hugging Face's tokenizer does not shift the IDs of the existing vocabulary. It appends the new tokens to the end of the vocabulary.

If your original tokenizer had a vocabulary size of n, and you add k new tokens, the new tokens will be assigned IDs n, n+1, ..., n+k-1 etc.

There's no hard limit to the number of tokens you can add, but there are a few considerations: Memory, model selection (I think BERT based models are limited to 512), and performance (kind of tied to memory).

Also do we also prepend the target language tag to make sure the model reads the input properly and treat them as special tokens?

Language tags can be used with some models to provide context, if its needed. Some models (like mBERT), this doesn't matter. Adding tags won't automatically enable the model to understand a new language. However, in your case it may? It's really use case specific tbh.

artemis
  • 6,857
  • 11
  • 46
  • 99
  • Won't `new_special_tokens = tokenizer.additional_special_tokens + ['aym_Latn']` shift the vocab IDs? And would that affect the main tokenizer vocabs IDs? If not, there must be some sort of limit to how many tokens we can add, before it affects the main tokenizer's ID, right? – alvas May 17 '23 at 04:12
  • Also do we also prepend the target language tag to make sure the model reads the input properly and treat them as special tokens? – alvas May 17 '23 at 04:35
  • 1
    I addressed your questions in my edit @alvas – artemis May 17 '23 at 15:09
  • I've granted you the bounty so that it don't go to waste =) But I think there's some bits that can be improved in terms of documenting the NLLB model and how it interacts with the tokenizer. I'll try to dig deeper and help improve the answer when I'm free. – alvas May 18 '23 at 11:59
  • Sure, let me know if you have any other questions, good luck :) @alvas – artemis May 19 '23 at 16:14