18

While running kubeflow pipeline having code that uses tensorflow 2.0. below error is displayed at end of each epoch

W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

Also, after some epochs, it does not show log and shows this error

This step is in Failed state with this message: The node was low on resource: memory. Container main was using 100213872Ki, which exceeds its request of 0. Container wait was using 25056Ki, which exceeds its request of 0.

Radhi
  • 6,289
  • 15
  • 47
  • 68

7 Answers7

5

In my case, I didn't match the batch_size and steps_per_epoch

For example,

his = Test_model.fit_generator(datagen.flow(trainrancrop_images, trainrancrop_labels, batch_size=batchsize),
                               steps_per_epoch=len(trainrancrop_images)/batchsize,
                               validation_data=(test_images, test_labels),
                               epochs=1,
                               callbacks=[callback])

batch_size in the datagen.flow must correspond to the steps_per_epoch in Test_model.fit_generator (actually, I used the wrong value on the steps_per_epoch)

This is one of the cases for the Error, I guess.

As a result, I think the problem arises when there is wrong correspondence on the batch size and steps(iterations)

Maybe the floats can be a problem when you get the step by dividing...

Check your code about this issue.

Good luck :)

Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
5

Upgrading tensorflow from 2.1 to 2.2 fixed this issue for me. I didn't have to go to tf-nightly version.

Safwan
  • 3,300
  • 1
  • 28
  • 33
  • 1
    Upgraded TensorFlow 2.1 to TensorFlow 2.2 and this issue is gone. me – user3284804 Jul 02 '20 at 23:37
  • @user3284804 - Please consider upvoting if this answer helped you. Thanks. – Safwan Jul 03 '20 at 05:10
  • I am running tensorflow-gpu on a conda env and it keeps installing version 2.1 and if I try to upgrade it using pip3 install --upgrade tensorflow-gpu i can't use it no more does anyone know how to upgrade a tensorflow-gpu version inside of a env – Dhouibi iheb Sep 07 '20 at 01:23
  • @Dhouibiiheb What do you mean by you cannot use it anymore? – Safwan Sep 08 '20 at 03:38
  • @Safwan meaning that when I try the following : pip install --upgrade tensorflow==2.2 / 2.3 tensorflow won't work anymore.. as far as I know, conda env supports tf 2.1 for now, not sure though – Dhouibi iheb Sep 09 '20 at 05:10
  • @Dhouibiiheb `conda` supports tf2.2 now. Use `conda install -c anaconda tensorflow-gpu` to install tf2.2 – Safwan Sep 09 '20 at 07:45
  • @Safwan I tried it already, nothing changes won't update tf to tf2.2 – Dhouibi iheb Sep 10 '20 at 03:13
3

This was due to incompatible CUDA and Tensorflow versions. below versions work well with each other

tensorflow-gpu==2.0.0

tensorflow-addons==0.6.0

nvidia/cuda:10.0-cudnn7-runtime

Community
  • 1
  • 1
Radhi
  • 6,289
  • 15
  • 47
  • 68
1

I have the same problem. People claimed that warming is superfluous and it has been removed in the tf-nightly, see here. But the memory leak is still there for each epoch.

MH Yip
  • 329
  • 1
  • 13
0

In my case: I installed tf-nightly. Now it's working, Though I am new to tensorflow. I followed this link

You can try.

Shantanu Nath
  • 363
  • 3
  • 13
0

To fix the problem you can add workers=1 in model.fit(...).

0

I tried following steps and it worked in my case

conda install tensorflow=2.0.0
conda install -c conda-forge keras=2.3.0