Roberta transformer for ner gives index out of range error

Question

I have a function below that tokenizes and aligns my labels, but it is giving me an error:

def tokenize_and_align_labels(examples, label_all_tokens=True): 
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True) 
    labels = [] 
    for i, label in enumerate(examples["ner_tags"]): 
        word_ids = tokenized_inputs.word_ids(batch_index=i) 
        # word_ids() => Return a list mapping the tokens
        # to their actual word in the initial sentence.
        # It Returns a list indicating the word corresponding to each token. 
        previous_word_idx = None 
        label_ids = []
        # Special tokens like `<s>` and `<\s>` are originally mapped to None 
        # We need to set the label to -100 so they are automatically ignored in the loss function.
        for word_idx in word_ids: 
            if word_idx is None: 
                # set –100 as the label for these special tokens
                label_ids.append(-100)
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            elif word_idx != previous_word_idx:
                # if current word_idx is != prev then its the most regular case
                # and add the corresponding token                 
                label_ids.append(label[word_idx]) 
            else: 
                # to take care of sub-words which have the same word_idx
                # set -100 as well for them, but only if label_all_tokens == False
                label_ids.append(label[word_idx] if label_all_tokens else -100) 
                # mask the subword representations after the first subword
                 
            previous_word_idx = word_idx 
        labels.append(label_ids) 
    tokenized_inputs["labels"] = labels 
    return tokenized_inputs

I located the line causing the error:

word_ids = tokenized_inputs.word_ids(batch_index=1)

This is the error produced:

My tokenized inputs, if run separately, without calling the function work fine as shown in the pic:

Can anyone please help me with the error? I spent 3 hours on this and it isn't working. Thanks!

For better explanation, here's the colab file too:https://colab.research.google.com/drive/1UJtc8TcuyCyFURKM1txYsqF1WKG_H6jZ#scrollTo=wc6AA6FMqDNq&uniqifier=1

score 0 · Answer 1 · answered Apr 11 '23 at 13:20

tokenize_and_align_labels expects to receive multiple examples, i.e. your examples["tokens"] should be a list of sentences, whereas right now it is a single sentence. If you take a look at the documentation, their example = wnut["train"][0] is a single sentence, which means their examples is a list of sentences.

Roberta transformer for ner gives index out of range error

1 Answers1