Loss is Nan for SegFormer vision transformer trained on BDD10k

Question

I'm trying to implement a SegFormer pretrained with a mit-b0 model to perform semantic segmentation on images obtained from the bdd100k dataset. Specifically, semantic segmentation has masks for only a subset of the 100k images, being 10k with appropriate masks for segmentation where the pixel value of the mask is the label between 0 - 18, or 255 for unknown labels. I'm also following this example from collab on a simple segmentation of three labels.

The problem I have is that any further training I do on the training data ends up with nan as a loss. Inspecting any predicted masks ends up with values of Nan which is not right. I've tried to ensure that the input images for training are normalized, reduced the learning rate, increased the epochs for learning, changed the pretrained model, but still end up with nan as a loss right away.

I have my datasets as:

dataset = tf.data.Dataset.from_tensor_slices((image_train_paths, mask_train_paths))
val_dataset = tf.data.Dataset.from_tensor_slices((image_val_paths, mask_val_paths))

with this method to preprocess and normalize the data

height = 512
width = 512
mean = tf.constant([0.485, 0.456, 0.406])
std = tf.constant([0.229, 0.224, 0.225])

def normalize(input_image):
    input_image = tf.image.convert_image_dtype(input_image, tf.float32)
    input_image = (input_image - mean) / tf.maximum(std, backend.epsilon())
    return input_image

# Define a function to load and preprocess each example
def load_and_preprocess(image_path, mask_path):
    # Load the image and mask
    image = tf.image.decode_jpeg(tf.io.read_file(image_path), channels=3)
    mask = tf.image.decode_jpeg(tf.io.read_file(mask_path), channels=1)

    # Preprocess the image and mask
    image = tf.image.resize(image, (height, width))
    mask = tf.image.resize(mask, (height, width), method='nearest')
    image = normalize(image)
    mask = tf.squeeze(mask, axis=-1)
    image = tf.transpose(image, perm=(2, 0, 1))
    return {'pixel_values': image, 'labels': mask}

Actually created the datasets:

batch_size = 4
train_dataset = (
    dataset
    .cache()
    .shuffle(batch_size * 10)
    .map(load_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(batch_size)
    .prefetch(tf.data.AUTOTUNE)
)

validation_dataset = (
    val_dataset
    .map(load_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(batch_size)
    .prefetch(tf.data.AUTOTUNE)
)

Setting up the labels and pre-trained model:

id2label = {
    0:  'road',
    1:  'sidewalk',
    2:  'building',
    3:  'wall',
    4:  'fence',
    5:  'pole',
    6:  'traffic light',
    7:  'traffic sign',
    8:  'vegetation',
    9:  'terrain',
    10: 'sky',
    11: 'person',
    12: 'rider',
    13: 'car',
    14: 'truck',
    15: 'bus',
    16: 'train',
    17: 'motorcycle',
    18: 'bicycle',
}
label2id = { label: id for id, label in id2label.items() }
num_labels = len(id2label)

model = TFSegformerForSemanticSegmentation.from_pretrained('nvidia/mit-b0', num_labels=num_labels, id2label=id2label, label2id=label2id, ignore_mismatched_sizes=True)

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.00001))

Finally fitting the data to the model, using only 1 epoch just to see if I can figure out why the loss in nan:

epochs = 1

history = model.fit(train_dataset, validation_data=validation_dataset, epochs=epochs)

Segformer implements it's own loss function, so I don't need to supply one. I see the collab example I was following has some sort of loss, but I can't figure out why mine is nan.

Did I approach this correctly, or am I missing something along the way? What else can I try to figure out on why the loss is nan? I did also make sure my labels used match between validation and training data sets. The pixel values ranged from 0 - 18 with 255 as unknown as supplied by the docs.

Edit: 3/16

I did find this example which pointed out some flaws I had in my approach, but even after following this example with everything besides how the dataset is gathered, I was still unable to produce any loss other than nan.

My new code is mostly the same, other then how I am pre-processing the data with numpy before converting them to tensors.

Dataset dict definition for training and validation data:

dataset = DatasetDict({
    'train': Dataset.from_dict({'pixel_values': image_train_paths, 'label': mask_train_paths}).cast_column('pixel_values', Image()).cast_column('label', Image()),
    'val': Dataset.from_dict({'pixel_values': image_val_paths, 'label': mask_val_paths}).cast_column('pixel_values', Image()).cast_column('label', Image())
})

train_dataset = dataset['train']
val_dataset = dataset['val']

train_dataset.set_transform(preprocess)
val_dataset.set_transform(preprocess)

where preprocess is where I am processing the images using the AutoImageProcessor to get the inputs.

image_processor  = AutoImageProcessor.from_pretrained('nvidia/mit-b0', semantic_loss_ignore_index=255) # This is a SegformerImageProcessor 

def transforms(image):
    image = tf.keras.utils.img_to_array(image)
    image = image.transpose((2, 0, 1))  # Since vision models in transformers are channels-first layout
    return image


def preprocess(example_batch):
    images = [transforms(x.convert('RGB')) for x in example_batch['pixel_values']]
    labels = [x for x in example_batch['label']]
    inputs = image_processor(images, labels)
    # print(type(inputs))
    return inputs

transforming the sets to a tensorflow dataset:

batch_size = 4

data_collator = DefaultDataCollator(return_tensors="tf")

train_set = dataset['train'].to_tf_dataset(
    columns=['pixel_values', 'label'],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
val_set = dataset['val'].to_tf_dataset(
    columns=['pixel_values', 'label'],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

fitting the model

history = model.fit(
    train_set,
    validation_data=val_set,
    epochs=10,
)

1750/1750 [==============================] - ETA: 0s - loss: nan

@Jimenemex Could you please confirm these, (1). your data and data-loader is fine as you tested them with non-segformer model and works as expected. (2). you use different dataset (not only bdd100k) on segformer and for all dataset you get nan loss. — Innat, Mar 18 '23 at 14:34
@Innat - I did not try with a different model, but I did confirm the data is loaded in the same format as this [collab](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/semantic_segmentation-tf.ipynb) notebook. The dataset matches the same shape/format as the notebook referenced and after the transform happens, there are no nan's in the input. The collab notebook I was using for reference above uses a different dataset and segformer does not get nan loss. I tried to follow the linked collab notebook, and my data is in the same format as the notebook. — Jimenemex, Mar 19 '23 at 01:19

Innat · Accepted Answer · 2023-04-22T07:47:01.103

Update

I've published a starter notebook on kaggle with BD100K dataset, by using HuggingFace and KerasCV library.

BDD100K HuggingFace & KerasCV Segmentation

As there is no reproducible code, it is hard to spot the main issue. But I've successfully trained SegFormer model with BDD10K dataset on semantic segmentation task. I'm sharing the solutions. I've used data and tested on kaggle environment, bdd100k-dataset. Here is the complete code. To make the test more complete, I've used this dataset on huggingface model (seg-former) and also with open source segmentation model (unet-efficientnet-b0).

One of the key difference perhaps in the above gist is how the data is being processed. The functionality likes DatasetDict, DefaultDataCollator, AutoImageProcessor are abset in the above gist. Only TFSegformerForSemanticSegmentation is present. It is better to perform data processing by looking at the code rather than relying on some auto tools.

Another difference for the huggingfacae model is, we can't pass custom or built-in loss function (also metrics), that is real unfortunate. In their doc, they showed some approach to evaluate the model but the procedure is for torch model. To realize on this issue, it would be better to reach huggingface forum.

Here is the Full code, and below are some highlights.

Preprocess and Dataloader (BDD100K)

data_path = 'bdd100k/seg/'
image_path = os.path.join(data_path, 'images', 'train')
label_path = os.path.join(data_path, 'labels', 'train')

def read_files(image_path, mask=False):
    image = tf.io.read_file(image_path)
    if mask:
        image = tf.image.decode_png(image, channels=1)
        image.set_shape([None, None, 1])
        image = tf.image.resize(
            images=image, 
            size=[IMAGE_SIZE, IMAGE_SIZE], 
            method=tf.image.ResizeMethod.NEAREST_NEIGHBOR
        )
        image = tf.where(image == 255, np.dtype('uint8').type(0), image)
        image = tf.cast(image, tf.int32)
    else:
        image = tf.image.decode_png(image, channels=3)
        image.set_shape([None, None, 3])
        image = tf.image.resize(images=image, size=[IMAGE_SIZE, IMAGE_SIZE])
        image = image / 255.
    return image

def load_data_HF(image_list, mask_list):
    image = read_files(image_list)
    mask  = read_files(mask_list, mask=True)
    image = tf.transpose(image, (2, 0, 1))
    mask = tf.squeeze(mask)
    return {"pixel_values": image, "labels": mask}

def data_generator_HF(image_list, mask_list, split='train'):
    dataset = tf.data.Dataset.from_tensor_slices((image_list, mask_list))
    dataset = dataset.shuffle(10 * BATCH_SIZE) if split == 'train' else dataset 
    dataset = dataset.map(load_data_HF, num_parallel_calls=tf.data.AUTOTUNE)
    dataset = dataset.batch(BATCH_SIZE, drop_remainder=False)
    return dataset.prefetch(tf.data.AUTOTUNE)

images = image_path
masks = label_path

train_ds_hf = data_generator_HF(images, masks)
val_ds_hf = data_generator_HF(images, masks, split='validation')

Segformer (Huggingface-Transformer)

from transformers import TFSegformerForSemanticSegmentation

model_checkpoint = "nvidia/mit-b0"
num_labels = n_classes
model_hf = TFSegformerForSemanticSegmentation.from_pretrained(
    model_checkpoint,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True,
)

optim = tf.keras.optimizers.Adam(0.001)
model_hf.compile(optimizer=optim)

model_hf.fit(
    train_ds_hf, 
    validation_data=val_ds_hf,
    epochs=5
)

73s 954ms/step - loss: 1.3698 - val_loss: 1.0942
Epoch 2/5
7s 382ms/step - loss: 0.8068 - val_loss: 1.0197
Epoch 3/5
7s 375ms/step - loss: 0.7110 - val_loss: 0.8641
Epoch 4/5
7s 388ms/step - loss: 0.6365 - val_loss: 0.8025
Epoch 5/5
7s 377ms/step - loss: 0.5678 - val_loss: 0.7920
<keras.callbacks.History at 0x7f5f11bf9710>
---

Some resource

stanford-background-scene-understanding-starter

score 1 · Answer 2 · answered Mar 18 '23 at 14:02

1

This might be due to a variety of factors, including:

It seems that there is a discrepancy in the number of classes. Check that the prediction dimension of your model matches the number of classes in the dataset you're working on. Please check your dataset labels.

Try this colab notebook: shorturl.at/hBHLR

answered Mar 18 '23 at 14:02

Faisal Shahbaz

431
3
12

I did check to make sure that the number of labels matches what's in the mask dataset. I iterated over the entire validation set and created a dictionary of all mask pixel values and their counts and the result was in-line with what I would expect to see. `Dictonary of all data: {0: 986, 1: 732, 2: 893, 3: 119, 4: 241, 5: 961, 6: 528, 7: 776, 8: 958, 9: 385, 10: 976, 11: 379, 12: 39, 13: 971, 14: 331, 15: 136, 16: 7, 17: 42, 18: 45, 255: 1000}` – Jimenemex Mar 19 '23 at 01:26

score 1 · Answer 3 · answered Mar 22 '23 at 21:39

1

You didn't set your loss function in your model.compile. Change your model.compile line to this

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.00001), loss='categorical_crossentropy', metrics=['accuracy'])

This should make your code work just fine

answered Mar 22 '23 at 21:39

mail_liw

106
1
9

No, in huggingface segformer, you shouldn't, loss fn is already calculated inside the model of segformer. – Innat Mar 23 '23 at 07:08
@Innat Why shouldn't you? You are compiling the model and you should specify loss type. – mail_liw Mar 23 '23 at 10:11
If you see the [source code](https://github.com/huggingface/transformers/blob/820c46a707ddd033975bc3b0549eea200e64c7da/src/transformers/models/segformer/modeling_tf_segformer.py#L873) of segformer implementation in huggingface, you would understand that, the loss function is already there and therefore no need to specify in the model.compile method. Check my answer, I gave whole details and working code with both huggingface model and non-huggingface model. In case of non-huggingface model, the loss funciton (also metric) is passed in the model.compile method. Hope, it is clear now. – Innat Mar 23 '23 at 14:12
It would still work won't it? It is a fix for the problem. – mail_liw Mar 23 '23 at 22:04
1

I think you didn't understand what I said. Please run the code yourself and observe. – Innat Mar 23 '23 at 22:25

score 0 · Answer 4 · answered Mar 27 '23 at 01:44

I recently ran into this problem as well, and it stems from the use of 255 as the background class label and how SparseCategoricalCrossentropy works under the hood. They attempt to mask out the nan values but a NaN * mask is still a nan

>>> y_true = [255, 1] #any number greater than the length of your predictions will nan
>>> y_pred = [[0.05, 0, 0.95], [0.1, 0.8, 0.1]]
>>> scce = tf.keras.losses.SparseCategoricalCrossentropy(reduction="none")
>>> scce(y_true, y_pred).numpy()
array([       nan, 0.22314355], dtype=float32)

You have two ways to address this issue.

Include the background class as part of your labels, non ideal or to fix this function under the hood.
fix the loss function to properly ignore the background class

I went with the second option because I also wanted to include class weights for my loss, my hacky but working solution looks like this by overriding the underlying loss function.

constant_weights = tf.constant(weights)

#based off the currently used loss https://github.com/huggingface/transformers/blob/v4.27.1/src/transformers/models/segformer/modeling_tf_segformer.py#L800
def custom_loss(logits, labels):
    # `labels` is of shape (batch_size, height, width)
    # logits are predicted as (batch_size, classes, height//2, width//2)
    logits = tf.transpose(logits, [0, 2, 3, 1])
    label_interp_shape = train_batch['labels'].shape[1:]
    upsampled_logits = tf.image.resize(logits, size=label_interp_shape, method="bilinear")
    
    loss_fct = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True,
        reduction="none",
        ignore_class=model.config.semantic_loss_ignore_index # missing from original 
    )
    
    def masked_loss(real, pred):
        label_weights = tf.one_hot(real, len(weights))*constant_weights
        label_weights = tf.reduce_sum(label_weights, axis=-1)
        unmasked_loss = loss_fct(real, pred)
        unmasked_loss *= label_weights
        mask = tf.cast(real != model.config.semantic_loss_ignore_index, dtype=unmasked_loss.dtype)
        masked_loss = unmasked_loss * mask
        reduced_masked_loss = tf.reduce_sum(masked_loss) /  tf.reduce_sum(mask)
        return tf.reshape(reduced_masked_loss, (1,))

    return masked_loss(labels, upsampled_logits)

model.hf_compute_loss = custom_loss

Loss is Nan for SegFormer vision transformer trained on BDD10k

4 Answers4

Update

Linked