I am using the RobertaTokenizerFast to tokenize some sentences and align them with annotations. I noticed an issue with some chatacters
from transformers import BatchEncoding, RobertaTokenizerFast
from tokenizers import Encoding
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
text = "Mr K. Strážnicke came to visit today."
annotations = [dict(start=0,end=16,text="Mr K. Strážnicke",label="MALE")]
tokenized_batch : BatchEncoding = tokenizer(text)
tokenized_text :Encoding =tokenized_batch[0]
tokens = tokenized_text.tokens
print(tokens)
returns ['<s>', 'Mr', 'ĠK', '.', 'ĠStr', 'á', 'Å', '¾', 'nic', 'ke', 'Ġcame', 'Ġto', 'Ġvisit', 'Ġtoday', '.', '</s>']
If I were to align the tokens with the label, it ends up cutting the span in half at the '¾' character.
def align_tokens_and_annotations_bilou(tokenized: Encoding, annotations):
tokens = tokenized.tokens
aligned_labels = ["O"] * len(
tokens
)
for anno in annotations:
annotation_token_ix_set = (
set()
)
for char_ix in range(anno["start"], anno["end"]):
token_ix = tokenized.char_to_token(char_ix)
if token_ix is not None:
annotation_token_ix_set.add(token_ix)
last_token_in_anno_ix = len(annotation_token_ix_set) - 1
for num, token_ix in enumerate(sorted(annotation_token_ix_set)):
if num == 0:
prefix = "B"
else:
prefix = "I"
aligned_labels[token_ix] = f"{prefix}-{anno['label']}"
return aligned_labels
labels = align_tokens_and_annotations_bilou(tokenized_text, annotations)
for token, label in zip(tokens, labels):
print(token, "-", label)
returns
<s> - O
Mr - B-MALE
ĠK - I-MALE
. - I-MALE
ĠStr - I-MALE
á - I-MALE
Å - I-MALE
¾ - O
nic - I-MALE
ke - I-MALE
Ġcame - O
Ġto - O
Ġvisit - O
Ġtoday - O
. - O
</s> - O
I figured I could fix it by switching the annotation_token_ix_set to
annotation_token_ix_set=list(range(annotation_token_ix_set[0],annotation_token_ix_set[-1]+1))
which seems to work for the alignment and I could train the model, but then at test time if similar tokens exist it might mess with the predictions. Any advice on this issue?