Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure

Question

I have been struggling with this problem for five days and read several posts on StackOverflow, but still cannot get a clear clue of how to solve this problem. People who solved this issue just recommended trying different NVIDIA driver versions until you find a lucky one that matches a CUDA version (10.1 mostly) for a specific GPU card.

I have an NVIDIA GeForce GTX 1015 Ti on one desktop (windows 10, 64-bit OS), and one NVIDIA GeForce RTX 2080Ti on another desktop (Windows 10, 64-bit system). I followed the hardware requirements on the TensorFlow official website to install GPU drivers (tried version 418.81 and 457.09 for the 1050 Ti GPU, and 432.00, 457.30 for the 2080 Ti GPU), CUDA Toolkit (10.1 for both desktops), and cuDNN (7.6.0 for both desktops) and modified the PATH environment variable finally. The TensorFlow version is 2.3.0, and the Python version is 3.7.9.

This works fun for an MNIST training dataset with this example code from the TensorFlow website. But I always got below errors for both PCs when I run some custom code (I have a custom model inherited from Keras. Model):

I'm not using TensorFlow for traditional neural network training, but just taking advantage of the auto-differentiation mechanism for an optimization problem.

I don't think my custom code has a problem because it runs well on Google Colab. And the same code runs well on my friend's Linux system.

The code to reproduce the error (no problem running on Google Colab):

# -*- coding: utf-8 -*-
## This code runs well in the Google Colab GPU runtime
## Yuanhang Zhang & Zheyuan Zhu, 12/1/2020, CREOL, UCF, Copyright reserved
## please contact yuanhangzhang@knights.ucf.edu if you want to use the code for research or publications
## all length units are in mm

import tensorflow as tf
import numpy as np
print('tensorflow version:',tf.__version__)

#%% ASM method
dx=np.float32(5e-3) # pixel size
N_obj= 64 # 512 

def tf_fft2d(x):
    with tf.name_scope('tf_fft2d'): # add name_scope, check in tensorboard
      x_shift = tf.signal.ifftshift(x)
      x_fft=tf.signal.fft2d(x_shift)
      y = tf.signal.fftshift(x_fft)
      return y

def tf_ifft2d(x):
    with tf.name_scope('tf_ifft2d'):
      x_shift = tf.signal.ifftshift(x)
      x_ifft=tf.signal.ifft2d(x_shift)
      y = tf.signal.fftshift(x_ifft)
      return y

# angular spectrum method (ASM), not band-limited
# @tf.function
def prop_ASM(Ein,z,wavelength,N_obj,dx):
    freq_obj = np.arange(-N_obj//2,N_obj//2,1)*(1/(dx*N_obj))
    kx = 2*np.pi*freq_obj
    ky = kx.copy()
    KX,KY = np.meshgrid(kx,ky)
    k0 = 2*np.pi/wavelength
    KZ_square = k0**2-KX**2-KY**2
    KZ_square[KZ_square<0] = 0
    Q = np.exp(-1j*z*np.sqrt(KZ_square)) # transfer function of freespace
    with tf.name_scope('prop_ASM'):
      FFT_obj = tf_fft2d(Ein)
      Q_tf = tf.constant(Q,dtype=tf.complex64)
      Eout = tf_ifft2d(FFT_obj*Q_tf)
      return Eout

print('N_obj:',N_obj)

import matplotlib.pyplot as plt
import shutil
shutil.rmtree('__pycache__',ignore_errors=True) # Delete an entire directory tree
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0' 

save_model_path='./models' 
save_mat_folder='./results' 
log_path='./tensorboard_log' # path to log training process
load_model_path = save_model_path

#%% inputs/ouputs for the optimization
x = (np.arange(N_obj,dtype = np.float32)-N_obj/2)*dx
y = (np.arange(N_obj,dtype = np.float32)-N_obj/2)*dx
x_c, y_c = np.meshgrid(x,y)

# input: Gaussian mode
e_in = np.zeros((N_obj, N_obj),dtype = np.float32)  # initialize input field
w_in = np.float32(5e-2)   # beam width

e = np.exp(-((x_c)**2+(y_c)**2)/w_in**2) # Gaussian beam spots array
I = np.sum(np.abs(e)**2)
e_in = e/np.sqrt(I) # normalize power

fig, ax = plt.subplots()
im=ax.imshow(e_in)
cbar=plt.colorbar(im)  
print('e_in shape:',e_in.shape)

# output: Hermite mode
e_out = np.zeros((N_obj, N_obj),dtype = np.float32)
w_out = np.float32(5e-2) # 30e-2
c = np.array([[0,0],[0,1]])
e = np.polynomial.hermite.hermgrid2d(np.sqrt(2)*x/w_out, np.sqrt(2)*y/w_out, c)*np.exp(-(x_c**2+y_c**2)/w_out**2)
e = np.float32(e)
I = np.sum(np.abs(e)**2)
e_out = e/np.sqrt(I) # power normalized

fig, ax = plt.subplots()
im=ax.imshow(e_out)
cbar=plt.colorbar(im)

print('e_out shape:',e_out.shape)

#%% optimization by GradientTape
z = 20 # propagating distance
lambda_design_list = np.array([1.550e-3],dtype = np.float32)

Ein = tf.constant(e_in, name = 'Ein', dtype = tf.complex64) # a 2D tensor
Eout = tf.constant(e_out, name = 'Eout', dtype = tf.complex64)

phi1 = tf.Variable(np.float32(np.ones((N_obj,N_obj))),name='phi1') # dtype: float32
phi2 = tf.Variable(np.float32(np.ones((N_obj,N_obj))),name='phi2')


def forward_propagate(Ein,z,lambda_design_list,N_obj,dx):
    E1_1 = prop_ASM(Ein,z,lambda_design_list[0],N_obj,dx) # used tf.signal.fft2d
    E1_mod_1 = E1_1*tf.exp(tf.complex(real=tf.zeros_like(phi1,dtype='float32'),imag=phi1))
    # E1_mod_1 = tf.math.multiply(E1_1,tf.exp(1j*phi1)) # element-wise muliply ?? not working !!
    E2_1 = prop_ASM(E1_mod_1,z,lambda_design_list[0],N_obj,dx)
    E2_mod_1 = E2_1*tf.exp(tf.complex(real=tf.zeros_like(phi2,dtype='float32'),imag=phi2)) 
    E_out = prop_ASM(E2_mod_1,z,lambda_design_list[0],N_obj,dx)
    # E_out = tf.math.multiply(E2_1,tf.exp(1j*phi2))
    return E_out

def loss_single(E_out, Eout): 
    coupling_eff = tf.sqrt(
        (tf.square(tf.reduce_sum(tf.math.real(E_out)*tf.math.real(Eout)+tf.math.imag(E_out)*tf.math.imag(Eout))) +
         tf.square(tf.reduce_sum(tf.math.imag(E_out)*tf.math.real(Eout)-tf.math.real(E_out)*tf.math.imag(Eout))) ))
    # or something simpler:
    # coupling_eff = tf.abs(tf.reduce_sum((tf.math.multiply(E_out,Eout))))
    loss = - coupling_eff
    return loss

variables = [phi1, phi2] # write variables in a list to optimize

# define optimizer
optimizer =  tf.keras.optimizers.Adam(learning_rate= 1e-2)
epoch_num = 20

for ii in tf.range(epoch_num):
  with tf.GradientTape() as tape:
    # this forward_propagate() function must be in the tape context! otherwise grads is None !!
    # the tape need to record the complete forward propagation 
    E_out = forward_propagate(Ein,z,lambda_design_list,N_obj,dx) 
    loss = loss_single(E_out, Eout)  
    tf.print('ii =:',ii,'coupling_eff =:',-loss)
    # print('watched variables in tape:',[var.name for var in tape.watched_variables()])

  # print("\n ===== calculate gradients now ====ERROR in NEXT LINE!!======\n\n")
  grads = tape.gradient(loss, variables) ## auto-differentiation
  # print(grads)

  # TensorFlow will update parameters automatically
  optimizer.apply_gradients(grads_and_vars=zip(grads, variables))

The kernel dies at grads = tape.gradient(loss, variables)

errors for both PCs:

2020-11-29 20:41:57.457271: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-11-29 20:41:57.457480: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:220] Unexpected Event status: 1
[I 20:42:05.512 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports

Could anyone tell me how to solve this issue? Is trying different versions of drivers blindly the only way to make it work?

The weird thing is there is no such error if I run a neural network training with Keras API this example on the PC. And if I write some very simple code with GradientTape to calculate gradients this linear regression example, there is no error either... In this way, it seems the driver is installed correctly ...Really confusing

check what version of tensorflow you are running, and the drivers that it needs, you will have to find compatible versions for the tensorflow version and your hardware — SajanGohil, Nov 30 '20 at 06:00
@SajanGohil I followed the [TensorFlow website](https://www.tensorflow.org/install/gpu) which said CUDA 10.1 requires a driver version of 418.x or higher. But The problem is there are so many driver versions out there and from other people's experience only a specific version will be compatible with a specific gpu. I am confused which one to choose for my GPU. — yuanhang, Dec 01 '20 at 00:55
Yes, a specific gpu will have some specific drivers that support it. You have to find a sweet spot for tensorflow, nvidia graphic card drivers, CUDA version, Cudnn version etc. Also, what amount of data are you using? — SajanGohil, Dec 01 '20 at 05:40
is all the data passed as e_in? If so, can you try a smaller subset, like few MB's? — SajanGohil, Dec 01 '20 at 13:41
@SajanGohil, Sorry I need to clarify it, I am only using several MB data currently. But I want to scale up to roughly 10GB in the future. — yuanhang, Dec 01 '20 at 15:08
Can you please add a minimum reproducible example, so that the problem can be reproduced — SajanGohil, Dec 01 '20 at 15:55
@SajanGohil Hello, I edited the post and gave the complete code. Please run it on your local machine and see if you will get the same error. Thanks~ — yuanhang, Dec 01 '20 at 22:14
Sorry, I cant run it locally (i don't have nvidia gpu), but you are right, since it runs the code is most likely correct. Issue might be with windows installation of tf/drivers etc. If you just want autodiff, I think you can go with JAX (https://jax.readthedocs.io/en/latest/notebooks/autodiff_cookbook.html) as another option if you can't resolve this — SajanGohil, Dec 02 '20 at 13:43

score 0 · Answer 1 · answered Sep 29 '22 at 21:41

Try to follow the official pip installation. Make sure to update your NVIDIA GPU driver (step 5. GPU setup). Install tf with pip
Make sure that you installed the right versions of cuda and cuDNN for your specific TensorFlow version. GPU versions
Eliminate your GPU memory growth limiting_gpu_memory_growth

Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure

1 Answers1