0

A newbie for machine learning here. I'm now training a fairly easy model from tutorial using the dataset fashion_mnist on Win10. However, the training process took extremely long and I didn't even finish it. But I used the same code on my friend's Linux system it took less than 1 min.

I tried to examine the problem but the setup and environment of my computer seemed fine.

import tensorflow as tf 
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
print(tf.test.is_built_with_cuda())

With the outcome:

device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 13701120911614314629
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 3061212774
locality {
  bus_id: 1
  links {
  }
}
incarnation: 7589776483736281928
physical_device_desc: "device: 0, name: GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5"
]
True

But the problem is almost 0% GPU-Util but high GPU Memory usage.


C:\Users\Herr LU>nvidia-smi
Mon Apr 06 16:36:53 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 442.19       Driver Version: 442.19       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1650   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   64C    P0    18W /  N/A |   3256MiB /  4096MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     22728      C   ...al\Programs\Python\Python37\pythonw.exe N/A      |
+-----------------------------------------------------------------------------+

C:\Users\Herr LU>

Here is the code:

#shoes recognition
import tensorflow as tf
from tensorflow import keras

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

#import dataset of clothes, return a path
mnist = keras.datasets.fashion_mnist

#seperate training data and testing data, which is already accomplished
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()

import matplotlib.pyplot as plt

#show the array in pictures,cmap=colormap
#plt.imshow(training_images[0])
#print(training_labels[0])
#print(training_images[0])

with tf.device('/device:gpu:0'):
    #normalizing the color value to 0~1
    training_images = training_images/255.0
    test_images = test_images/255.0

    #Build a model
    model=keras.Sequential([keras.layers.Flatten(),
                            keras.layers.Dense(128,activation=tf.nn.relu),
                            keras.layers.Dense(10,activation=tf.nn.softmax)])

    #Compile the model with an optimzer and a loss function
    model.compile(optimizer = keras.optimizers.Adam(),
                  loss = 'sparse_categorical_crossentropy',
                  metrics = ['accuracy'])

    #train the model with data
    model.fit(training_images, training_labels, epochs=5)

    #evaluate the model
    model.evaluate(test_images, test_labels)

What should I do to solve this problem?

PokeLu
  • 767
  • 8
  • 17
  • Have you checked this question: https://stackoverflow.com/questions/46080634/keras-with-tensorflow-backend-not-using-gpu I think it maybe the case of installation that keras sees the gpu but only runs on cpu – gnahum Apr 06 '20 at 21:09
  • The model is very simple, GPUs have much more compute than you can imagine. You can increase utilization by increasing the batch size (more parallelism). – Dr. Snoopy Apr 06 '20 at 21:47
  • Im sure i didn't install tensorflow and there's only tensorflow-gpu 2.0.0.So the link above couldn't solve my problem. – PokeLu Apr 06 '20 at 22:31
  • Can you check if tensorflow is using gpu in the session. Like: https://stackoverflow.com/questions/45662253/can-i-run-keras-model-on-gpu – gnahum Apr 06 '20 at 22:58
  • @gnahum I followed every step in the link. But there's absolutely no problem. The available gpu was showed and I did the matrix multiplication below and the consequence showed: ```tf.Tensor( [[22. 28.] [49. 64.]], shape=(2, 2), dtype=float32) ``` It didn't show on which device the program was running though I used tf.debugging.set_log_device_placement(True). @gnahum – PokeLu Apr 07 '20 at 09:25
  • Can you try this answer: This is a work around I found: 1. Create a state_dict like PyTorch 2. Get the model architecture as JSON 3. Clear the Keras session and delete the model instance 4. Create a new model from the JSON within tf.device context Load the previous weights from state_dict (see https://stackoverflow.com/questions/59616788/how-to-move-a-tensorflow-keras-model-to-gpu) – gnahum Apr 07 '20 at 18:12
  • @gnahum Thank you so much for your help! I finally solved it with miniconda. It turns out that there must be something wrong when I'm building the cuda and cudnn environment, which I couldn't find myself. – PokeLu Apr 10 '20 at 21:50
  • Glad I could help!! – gnahum Apr 10 '20 at 22:11

1 Answers1

0

You have to track CUDA progress if you really want to track GPU usage, to track CUDA progress open the task manager click on performance, and select GPU, in the GPU section change anyone of the first four progress to "CUDA" and you will see if the cuda cores are in the usage or not.

you can select the cuda from the dropdown menu of any one of the first four progress bars in the gpu section.

Gavriel Cohen
  • 4,355
  • 34
  • 39
skynet
  • 1