3

I am trying to train my tensorflow model on my RTX 3070 GPU. I am using an anaconda virtual environment and the prompt shows that the GPU is successfully detected and doesn't show any errors or warnings but whenever the model starts training it uses the CPU instead.

My Anaconda Prompt:

2020-11-28 19:38:17.373117: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-11-28 19:38:17.378626: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-11-28 19:38:17.378679: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-11-28 19:38:17.381802: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2020-11-28 19:38:17.382739: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2020-11-28 19:38:17.389401: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2020-11-28 19:38:17.391830: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2020-11-28 19:38:17.392332: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-11-28 19:38:17.392422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1866] Adding visible gpu devices: 0
2020-11-28 19:38:26.072912: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-28 19:38:26.073904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1724] Found device 0 with properties:
pciBusID: 0000:08:00.0 name: GeForce RTX 3070 computeCapability: 8.6
coreClock: 1.725GHz coreCount: 46 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-11-28 19:38:26.073984: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-11-28 19:38:26.074267: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-11-28 19:38:26.074535: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-11-28 19:38:26.074775: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2020-11-28 19:38:26.075026: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2020-11-28 19:38:26.075275: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2020-11-28 19:38:26.075646: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2020-11-28 19:38:26.075871: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-11-28 19:38:26.076139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1866] Adding visible gpu devices: 0
2020-11-28 19:38:26.738596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1265] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-28 19:38:26.738680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1271]      0
2020-11-28 19:38:26.739375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1284] 0:   N
2020-11-28 19:38:26.740149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1410] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6589 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3070, pci bus id: 0000:08:00.0, compute capability: 8.6)
2020-11-28 19:38:26.741055: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2020-11-28 19:38:28.028828: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:126] None of the MLIR optimization passes are enabled (registered 2)
2020-11-28 19:38:32.428408: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-11-28 19:38:33.305827: I tensorflow/stream_executor/cuda/cuda_dnn.cc:344] Loaded cuDNN version 8004
2020-11-28 19:38:33.753275: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-11-28 19:38:34.603341: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-11-28 19:38:34.610934: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

My Model Code:

inputs = keras.Input(shape=(None,), dtype="int32")
x = layers.Embedding(max_features, 128)(inputs)
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val))

I am using:

  • tensorflow nightly gpu 2.5.0.dev20201111 (installed on a anaconda virtual env)
  • CUDA 11.1 (cuda_11.1.1_456.81)
  • CUDNN v8.0.4.30 (for CUDA 11.1)
  • python 3.8

I know that my GPU is not being used because its utilization is at 1% while my CPU is at 60% with its top process being python.

Can anyone help me get my model training using the GPU?

Josh
  • 75
  • 1
  • 7
  • Try rebooting. Once I had similar issue, and tensorflow was not initializing properly after in uninstalled normal tf and installed tf-gpu https://stackoverflow.com/questions/44829085/tensorflow-not-running-on-gpu – Shivam Miglani Nov 29 '20 at 01:22
  • @ShivamMiglani I've already tried rebooting. Hasn't fixed anything but thanks for the advice. – Josh Nov 29 '20 at 01:50
  • Have you checked this out? https://stackoverflow.com/a/52905362/7363404 – Axiumin_ Nov 29 '20 at 02:05
  • Just for future reference, it does not make to compare 1% GPU utilization with 60% CPU utilization, these are not related. Your model is being trained on GPU, that is what 1% GPU utilization actually means. – Dr. Snoopy Mar 22 '23 at 18:46

1 Answers1

0

Most probably you're using Tensorflow for CPU, instead of that for GPU. Do a "pip uninstall tensorflow" and "pip install tensorflow-gpu" to install the one appropriate for utilizing the GPU.

  • My bad, I didn't specify that I am using tf-nightly-gpu. – Josh Nov 29 '20 at 01:59
  • Not a problem. I have these 2 suggestions for you: 1) Check if you have CUDA loaded into your environment. 2) Add the following line after you import TF, and print the variable "gpus" to check if the device/s can be found by the code. "gpus = tf.config.experimental.list_physical_devices('GPU')" – Tarak Nath Nandi Nov 29 '20 at 02:06
  • 1
    Or add this to your code to print the list of available devices: from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) – Tarak Nath Nandi Nov 29 '20 at 02:10
  • I'm sure that CUDA is installed correctly because I don't get any CUDA related errors in my anaconda prompt. The gpu can be found by code when I print the gpu variable and when I try tf.config.list_physical_devices('GPU'). – Josh Nov 29 '20 at 02:15
  • Ok. Did you install TF by yourself? From what you pasted in your question, it seems TF can see your devices, but one warning caught my attention "This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags" . – Tarak Nath Nandi Nov 29 '20 at 02:32
  • I used pip install tf-nightly-gpu==2.5.0.dev20201111 to install tensorflow. Is this warning what could be causing the model to train on cpu? – Josh Nov 29 '20 at 02:35
  • Also, what is the version of CUDA you're using? It is possible that the TF version isn't compatible with the CUDA version. In that case, I'd suggest that you install a somewhat older version of TF-GPU and check what happens (go for TF <2.3) – Tarak Nath Nandi Nov 29 '20 at 02:44
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/225270/discussion-between-josh-and-tarak-nath-nandi). – Josh Nov 29 '20 at 03:07