-1

I am trying to create a neural network that can classify an image into a rectangle or circle and find the bounding box for learning purposes.

For this I am using tensorflow.keras on an artificially created training set of images.

The main problem is that the loss is plateauing towards nonsensical values and I have no clue whats going wrong here.

I tried changing adding/removing layers & neurons, changed the optimizer and loss also but no improvement.so kindly suggest.

Main code snippets are given below:

blank=255*np.ones(shape=(200,200,1),dtype='uint8')
shapes=['rect','circle']
inputShape=blank.shape 
x_train,y_train=[],[]
for i in range(0,1000):
 img=blank.copy()
 shapeChoice=choice(shapes)
 if shapeChoice=='rect':
  x1,y1=randrange(50,100),randrange(50,100)
  x3,y3=x1+randrange(10,50),y1+randrange(10,50)
  xc,yc,h,w=x1/2+x3/2,y1/2+y3/2,x3-x1,y3-y1
  img=cv2.rectangle(img,(x1,y1),(x3,y3),0,-1)
  y_train.append([1,0,xc,yc,w,h])
 if shapeChoice=='circle':
  xc,yc,R=randrange(50,100),randrange(50,100),randrange(10,50)
  img=cv2.circle(img,(xc,yc),R,0,-1)
  y_train.append([0,1,xc,yc,R,R])
 x_train.append(img)

x_train=np.array(x_train,dtype='uint8')
y_train=np.array(y_train,dtype='uint8')
x_train=x_train/255
outputUnits=sum(y_train[0].shape)
cnn = models.Sequential([
    layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu', input_shape=inputShape),
    layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu'),
    layers.Conv2D(filters=16, kernel_size=(3, 3), activation='relu'),
    layers.Conv2D(filters=8, kernel_size=(3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(filters=4, kernel_size=(2, 2), activation='relu'),
    layers.MaxPooling2D((2, 2)),    
    layers.Flatten(),
    layers.Dense(16*outputUnits, activation='relu'),
    layers.Dense(8*outputUnits, activation='relu'),
    layers.Dense(4*outputUnits, activation='relu'),
    layers.Dense(2*outputUnits, activation='relu'),
    layers.Dense(outputUnits, activation='softmax')])

cnn.compile(optimizer=optimizers.Adam(learning_rate=0.1,beta_1=0.9,beta_2=0.99),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history=cnn.fit(x_train, y_train, epochs=5,batch_size=5)

The output of training epochs:

Epoch 1/5
200/200 [==============================] - 22s 15ms/step - loss: nan - accuracy: 0.0510
Epoch 2/5
200/200 [==============================] - 3s 15ms/step - loss: nan - accuracy: 0.0000e+00
Epoch 3/5
200/200 [==============================] - 3s 16ms/step - loss: nan - accuracy: 0.0000e+00
Epoch 4/5
200/200 [==============================] - 3s 15ms/step - loss: nan - accuracy: 0.0000e+00
Epoch 5/5
200/200 [==============================] - 3s 15ms/step - loss: nan - accuracy: 0.0000e+00
user16217248
  • 3,119
  • 19
  • 19
  • 37
Vishnu Balaji
  • 175
  • 1
  • 11

1 Answers1

1

I'm not really an expert on the field, but I can identify some problems with your approach and those may be the reason why your model is not performing as expected. I'll try to explain them in the following sections.

Labels

First of all you are defining your labels in a weird way. For the rectangles, the labels are [prob_rect, prob_circle, center_x, center_y, width, height] while the circles are [prob_rect, prob_circle, center_x, center_y, radius, radius]. The rectangles are fine, but you should use the diameter instead of the radius for the circles (otherwise they won't fit in the box).

Loss function

You are using the raw categorical cross-entropy, which is a staple when you are doing image classification, but we are not doing that.

The categorical cross-entropy expects to deal with class probabilities and labels encoded in a one-hot representation. However, your labels (y_train) have a bunch of other stuff in it (like coordinates, width, height...) and the cross-entropy has no idea how to make sense of those things.

This loss function can be used, but not raw. You have to customize it to fit your needs.

Architecture

The traditional way to do object detection involves two steps:

  1. Extract features with a feature extractor (The backbone. Normally the feature extractor is an adapted image classifier. In your network, the convolutional layers essentially do the same thing)
  2. Use the features to (somehow) predict the bounding boxes and class probabilities (after all we want to draw a box AND know the class of whatever is inside the box)

Step number 1 is the reason why advancements in image classification translate more or less directly to image detection. Step number 2 is the reason object detection is hard to make sense of.

Fortunately in this case we're not dealing with object detection. We are dealing with classification + localization, which is basically a restricted version of object detection. The restriction being: our images are guaranteed to have one and only one "classifiable element", which means we'll always have 1 class and we'll always have 1 bounding box. This rarely happens in object detection.

The network you defined is currently trying to do everything (predict class probabilities and bounding boxes coordinates) at once, but it has no hope of succeeding because of what was discussed in the Loss function section. If the loss function doesn't make any sense, the network has no idea if it is doing the right thing.

One of the ways (not the only way) you can accomplish step 2 is to not do everything at once. You define two "heads":

  • One that tries to predict class probabilities, using a loss function that works with probabilities (crossentropy, softmax loss...)
  • One that predicts the bounding box coordinates, using a loss function that works well with regression (MSE, MAE, ...)

Your total loss will be some kind of combination (like the sum) of both losses.

The probabilities head can be a fully connected layer with two outputs (one for each class), and the box coordinates head can be a fully connected layer with four outputs (one for each: center_x, center_y, width, height).

Those heads would have as input one of the backbone layers or one of the fully connected layers that comes after the convolutional layers. Your final network output then would be a combination of the output of both heads.

Metrics

You specified accuracy as the performance metric. In the context of your problem, how do you calculate or interpret accuracy? It's complicated. What is a "hit" and what is a "miss"? Aren't most cases just partial hits/misses?

I won't get into too much detail (none at all), but there are other metrics that make more sense in the context of this problem. One of the most popular being mAP (Mean Average Precision).

CarlosGDCJ
  • 424
  • 1
  • 8
  • sounds convincing let me try ,but just to confirm are you suggesting we create 2 models (1 for class and 2nd for bounding box) with first loss as categorical cross entropy and 2nd loss as mse ? – Vishnu Balaji May 17 '23 at 03:03
  • You would still only have one model, but your model would now have a "bifurcation". It would go from --- to --< or __/ _ .Something like that. This would translate into you not being able to implement the entire model using a single call to `keras.Sequential`. You may need to take a look at the [functional API](https://www.tensorflow.org/guide/keras/functional). You can also add a layer to concatenate the results from both heads at the end. For the loss you can [define a custom one](https://stackoverflow.com/a/65246143/20977133) that takes `y_true`, `y_pred` and interpret the values correctly. – CarlosGDCJ May 17 '23 at 12:08