I'm trying to implement a SegFormer pretrained with a mit-b0 model to perform semantic segmentation on images obtained from the bdd100k dataset. Specifically, semantic segmentation has masks for only a subset of the 100k images, being 10k with appropriate masks for segmentation where the pixel value of the mask is the label between 0 - 18, or 255 for unknown labels. I'm also following this example from collab on a simple segmentation of three labels.
The problem I have is that any further training I do on the training data ends up with nan as a loss. Inspecting any predicted masks ends up with values of Nan which is not right. I've tried to ensure that the input images for training are normalized, reduced the learning rate, increased the epochs for learning, changed the pretrained model, but still end up with nan as a loss right away.
I have my datasets as:
dataset = tf.data.Dataset.from_tensor_slices((image_train_paths, mask_train_paths))
val_dataset = tf.data.Dataset.from_tensor_slices((image_val_paths, mask_val_paths))
with this method to preprocess and normalize the data
height = 512
width = 512
mean = tf.constant([0.485, 0.456, 0.406])
std = tf.constant([0.229, 0.224, 0.225])
def normalize(input_image):
input_image = tf.image.convert_image_dtype(input_image, tf.float32)
input_image = (input_image - mean) / tf.maximum(std, backend.epsilon())
return input_image
# Define a function to load and preprocess each example
def load_and_preprocess(image_path, mask_path):
# Load the image and mask
image = tf.image.decode_jpeg(tf.io.read_file(image_path), channels=3)
mask = tf.image.decode_jpeg(tf.io.read_file(mask_path), channels=1)
# Preprocess the image and mask
image = tf.image.resize(image, (height, width))
mask = tf.image.resize(mask, (height, width), method='nearest')
image = normalize(image)
mask = tf.squeeze(mask, axis=-1)
image = tf.transpose(image, perm=(2, 0, 1))
return {'pixel_values': image, 'labels': mask}
Actually created the datasets:
batch_size = 4
train_dataset = (
dataset
.cache()
.shuffle(batch_size * 10)
.map(load_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
.batch(batch_size)
.prefetch(tf.data.AUTOTUNE)
)
validation_dataset = (
val_dataset
.map(load_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
.batch(batch_size)
.prefetch(tf.data.AUTOTUNE)
)
Setting up the labels and pre-trained model:
id2label = {
0: 'road',
1: 'sidewalk',
2: 'building',
3: 'wall',
4: 'fence',
5: 'pole',
6: 'traffic light',
7: 'traffic sign',
8: 'vegetation',
9: 'terrain',
10: 'sky',
11: 'person',
12: 'rider',
13: 'car',
14: 'truck',
15: 'bus',
16: 'train',
17: 'motorcycle',
18: 'bicycle',
}
label2id = { label: id for id, label in id2label.items() }
num_labels = len(id2label)
model = TFSegformerForSemanticSegmentation.from_pretrained('nvidia/mit-b0', num_labels=num_labels, id2label=id2label, label2id=label2id, ignore_mismatched_sizes=True)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.00001))
Finally fitting the data to the model, using only 1 epoch just to see if I can figure out why the loss in nan:
epochs = 1
history = model.fit(train_dataset, validation_data=validation_dataset, epochs=epochs)
Segformer implements it's own loss function, so I don't need to supply one. I see the collab example I was following has some sort of loss, but I can't figure out why mine is nan.
Did I approach this correctly, or am I missing something along the way? What else can I try to figure out on why the loss is nan? I did also make sure my labels used match between validation and training data sets. The pixel values ranged from 0 - 18 with 255 as unknown as supplied by the docs.
Edit: 3/16
I did find this example which pointed out some flaws I had in my approach, but even after following this example with everything besides how the dataset is gathered, I was still unable to produce any loss other than nan.
My new code is mostly the same, other then how I am pre-processing the data with numpy before converting them to tensors.
Dataset dict definition for training and validation data:
dataset = DatasetDict({
'train': Dataset.from_dict({'pixel_values': image_train_paths, 'label': mask_train_paths}).cast_column('pixel_values', Image()).cast_column('label', Image()),
'val': Dataset.from_dict({'pixel_values': image_val_paths, 'label': mask_val_paths}).cast_column('pixel_values', Image()).cast_column('label', Image())
})
train_dataset = dataset['train']
val_dataset = dataset['val']
train_dataset.set_transform(preprocess)
val_dataset.set_transform(preprocess)
where preprocess is where I am processing the images using the AutoImageProcessor to get the inputs.
image_processor = AutoImageProcessor.from_pretrained('nvidia/mit-b0', semantic_loss_ignore_index=255) # This is a SegformerImageProcessor
def transforms(image):
image = tf.keras.utils.img_to_array(image)
image = image.transpose((2, 0, 1)) # Since vision models in transformers are channels-first layout
return image
def preprocess(example_batch):
images = [transforms(x.convert('RGB')) for x in example_batch['pixel_values']]
labels = [x for x in example_batch['label']]
inputs = image_processor(images, labels)
# print(type(inputs))
return inputs
transforming the sets to a tensorflow dataset:
batch_size = 4
data_collator = DefaultDataCollator(return_tensors="tf")
train_set = dataset['train'].to_tf_dataset(
columns=['pixel_values', 'label'],
shuffle=True,
batch_size=batch_size,
collate_fn=data_collator,
)
val_set = dataset['val'].to_tf_dataset(
columns=['pixel_values', 'label'],
shuffle=False,
batch_size=batch_size,
collate_fn=data_collator,
)
fitting the model
history = model.fit(
train_set,
validation_data=val_set,
epochs=10,
)
1750/1750 [==============================] - ETA: 0s - loss: nan