Keras model training on GPU stops/hangs at first epoch

Question

In the Mask RCNN model I replaced the Lambda layer below by a custom layer. While the model compiles it does not train on GPU. It seems to stop at Epoch 1 right before allocating Workers. I am not certain what I am doing wrong. The lambda layer will not allow me to save the model so I have to use a different approach but it is not working. Any help is appreciated.

gt_boxes = KL.Lambda(lambda x: norm_boxes_graph(x, K.shape(input_image)[1:3]))(input_gt_boxes)

by a custom layer

gt_boxes = GtBoxesLayer(name='lambda_get_norm_boxes')([input_image, input_gt_boxes])

The layer code is:

class GtBoxesLayer(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(GtBoxesLayer, self).__init__(**kwargs)

    def call(self, input):
        return norm_boxes_graph(input[1], get_shape_image(input[0]))

    def get_config(self):
        config = super(GtBoxesLayer, self).get_config()
        return config
    
    @classmethod
    def from_config(cls, config):
        return cls(**config)



def norm_boxes_graph(self, boxes, shape):
        """Converts boxes from pixel coordinates to normalized coordinates.
        boxes: [..., (y1, x1, y2, x2)] in pixel coordinates
        shape: [..., (height, width)] in pixels

        Note: In pixel coordinates (y2, x2) is outside the box. But in normalized
        coordinates it's inside the box.

        Returns:
            [..., (y1, x1, y2, x2)] in normalized coordinates
        """
        h, w = tf.split(tf.cast(shape, tf.float32), 2)
        scale = tf.concat([h, w, h, w], axis=-1) - tf.constant(1.0)
        shift = tf.constant([0., 0., 1., 1.])
        fin = tf.divide(boxes - shift, scale)
        return fin

def get_shape_image_(input_image_):
     shape_= tf.shape(input_image_)
     return shape_[1:3]

An observation, I am not exceeding my GPU resources. I am monitoring the resources with **glance** and it seems that CPU% goes down, the memory stays at 8% and the GPU% is 0. I can list 21 new processes being opened which is expected for my 20 workers, but the resources are not used. Usually, I see them working each at 100%, but now nothing happens. — Mihai.Mehe, Jan 12 '23 at 04:34

score 0 · Answer 1 · answered Jan 16 '23 at 23:19

0

This smaller example I wrote and answer for one custom layer shows how to customize the layers for this model. Also this model needs all the available multiprocessing units in my case workers = 40. The save_weights_only=True and layers='heads' were set. Also tf.config.experimental.set_memory_growth(gpus[gpu_index], True)Each of the original custom layers in this model must have get_config() implemented to be properly serialized. The training and saving of the full model worked afterwards.

answered Jan 16 '23 at 23:19

Mihai.Mehe

448
8
13

i have a [similar problem](https://stackoverflow.com/questions/75133574/use-multiprocessing-true-in-mask-rcnn-with-keras-2-x-tensorflow-2-x) to yours. did you use the original Matterport Mask RCNN as base or you had more changes (except the ones needed for compatibility)? Thanks! – ArieAI Jan 17 '23 at 08:35
1

Yes, I used the original and added these changes. The custom layers allowed me to save the complete model to avoid "can't pickle RLock objects" errors. After solving that, I could not train it. Smaller examples worked while the complete model was still not training until I did all of the above. – Mihai.Mehe Jan 17 '23 at 15:57
could you please share the changes exact changes and locations ? (are they all in the model.py file?) and I dont fully understand why a custom layer is needed? I don't fully follow how to reproduce your changes. – ArieAI Jan 19 '23 at 16:34
e.g. where exactly should this line come? `tf.config.experimental.set_memory_growth(gpus[gpu_index], True)`? thanks. – ArieAI Jan 19 '23 at 16:41
at the beginning, before running the model – Mihai.Mehe Jan 20 '23 at 01:43

Keras model training on GPU stops/hangs at first epoch

1 Answers1