RoBERTa tokenizer issue for certain characters

Question

I am using the RobertaTokenizerFast to tokenize some sentences and align them with annotations. I noticed an issue with some chatacters

from transformers import BatchEncoding, RobertaTokenizerFast
from tokenizers import Encoding

tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
text = "Mr K. Strážnicke came to visit today."
annotations = [dict(start=0,end=16,text="Mr K. Strážnicke",label="MALE")]

tokenized_batch : BatchEncoding = tokenizer(text)
tokenized_text :Encoding  =tokenized_batch[0]
tokens = tokenized_text.tokens
print(tokens)

returns ['<s>', 'Mr', 'ĠK', '.', 'ĠStr', 'Ã¡', 'Å', '¾', 'nic', 'ke', 'Ġcame', 'Ġto', 'Ġvisit', 'Ġtoday', '.', '</s>']

If I were to align the tokens with the label, it ends up cutting the span in half at the '¾' character.

def align_tokens_and_annotations_bilou(tokenized: Encoding, annotations):
    tokens = tokenized.tokens
    aligned_labels = ["O"] * len(
        tokens
    )
    for anno in annotations:
        annotation_token_ix_set = (
            set()
        )
        for char_ix in range(anno["start"], anno["end"]):
            token_ix = tokenized.char_to_token(char_ix)
            if token_ix is not None:
                annotation_token_ix_set.add(token_ix)       
        last_token_in_anno_ix = len(annotation_token_ix_set) - 1
        for num, token_ix in enumerate(sorted(annotation_token_ix_set)):
            if num == 0:
                prefix = "B"
            else:
                prefix = "I"
            aligned_labels[token_ix] = f"{prefix}-{anno['label']}"
    return aligned_labels
labels = align_tokens_and_annotations_bilou(tokenized_text, annotations)
for token, label in zip(tokens, labels):
    print(token, "-", label)

returns

<s> - O
Mr - B-MALE
ĠK - I-MALE
. - I-MALE
ĠStr - I-MALE
Ã¡ - I-MALE
Å - I-MALE
¾ - O
nic - I-MALE
ke - I-MALE
Ġcame - O
Ġto - O
Ġvisit - O
Ġtoday - O
. - O
</s> - O

I figured I could fix it by switching the annotation_token_ix_set to

annotation_token_ix_set=list(range(annotation_token_ix_set[0],annotation_token_ix_set[-1]+1))

which seems to work for the alignment and I could train the model, but then at test time if similar tokens exist it might mess with the predictions. Any advice on this issue?

Clean your data before feeding it into your model. Preprocessing is incredibly important for NLP applications. — Tim J, Aug 30 '22 at 09:21
@TimJ I cannot remove spans with foreign characters as they are important to the task and it would be meaningless if i did. I was just wondering if this is known for tokenizing with RoBERTa and can be solved somehow — Paschalis, Aug 30 '22 at 09:23
Why don't you only check `char_to_token(0)`, `char_to_token(15)` and align the whole range? — cronoik, Sep 02 '22 at 00:03

RoBERTa tokenizer issue for certain characters

0 Answers0