1

I am using the RobertaTokenizerFast to tokenize some sentences and align them with annotations. I noticed an issue with some chatacters

from transformers import BatchEncoding, RobertaTokenizerFast
from tokenizers import Encoding

tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
text = "Mr K. Strážnicke came to visit today."
annotations = [dict(start=0,end=16,text="Mr K. Strážnicke",label="MALE")]

tokenized_batch : BatchEncoding = tokenizer(text)
tokenized_text :Encoding  =tokenized_batch[0]
tokens = tokenized_text.tokens
print(tokens)

returns ['<s>', 'Mr', 'ĠK', '.', 'ĠStr', 'á', 'Å', '¾', 'nic', 'ke', 'Ġcame', 'Ġto', 'Ġvisit', 'Ġtoday', '.', '</s>']

If I were to align the tokens with the label, it ends up cutting the span in half at the '¾' character.

def align_tokens_and_annotations_bilou(tokenized: Encoding, annotations):
    tokens = tokenized.tokens
    aligned_labels = ["O"] * len(
        tokens
    )
    for anno in annotations:
        annotation_token_ix_set = (
            set()
        )
        for char_ix in range(anno["start"], anno["end"]):
            token_ix = tokenized.char_to_token(char_ix)
            if token_ix is not None:
                annotation_token_ix_set.add(token_ix)       
        last_token_in_anno_ix = len(annotation_token_ix_set) - 1
        for num, token_ix in enumerate(sorted(annotation_token_ix_set)):
            if num == 0:
                prefix = "B"
            else:
                prefix = "I"
            aligned_labels[token_ix] = f"{prefix}-{anno['label']}"
    return aligned_labels
labels = align_tokens_and_annotations_bilou(tokenized_text, annotations)
for token, label in zip(tokens, labels):
    print(token, "-", label)

returns

<s> - O
Mr - B-MALE
ĠK - I-MALE
. - I-MALE
ĠStr - I-MALE
á - I-MALE
Å - I-MALE
¾ - O
nic - I-MALE
ke - I-MALE
Ġcame - O
Ġto - O
Ġvisit - O
Ġtoday - O
. - O
</s> - O

I figured I could fix it by switching the annotation_token_ix_set to

annotation_token_ix_set=list(range(annotation_token_ix_set[0],annotation_token_ix_set[-1]+1))

which seems to work for the alignment and I could train the model, but then at test time if similar tokens exist it might mess with the predictions. Any advice on this issue?

Paschalis
  • 191
  • 10
  • Clean your data before feeding it into your model. Preprocessing is incredibly important for NLP applications. – Tim J Aug 30 '22 at 09:21
  • @TimJ I cannot remove spans with foreign characters as they are important to the task and it would be meaningless if i did. I was just wondering if this is known for tokenizing with RoBERTa and can be solved somehow – Paschalis Aug 30 '22 at 09:23
  • Why don't you only check `char_to_token(0)`, `char_to_token(15)` and align the whole range? – cronoik Sep 02 '22 at 00:03

0 Answers0