How to run python on GPU with CuPy?

Question

I'm trying to execute Python code on GPU using CuPy library. However, when I run nvidia-smi, no GPU processes are found.

Here's the code:

    import numpy as np
    import cupy as cp
    from scipy.stats import rankdata

    def get_top_one_probability(vector):
      return (cp.exp(vector) / cp.sum(cp.exp(vector)))

    def get_listnet_gradient(training_dataset, real_labels, predicted_labels):
      ly_topp = get_top_one_probability(real_labels)
      cp.cuda.Stream.null.synchronize()
      s1 = -cp.matmul(cp.transpose(training_dataset), cp.reshape(ly_topp, (np.shape(cp.asnumpy(ly_topp))[0], 1)))
      cp.cuda.Stream.null.synchronize()
      exp_lz_sum = cp.sum(cp.exp(predicted_labels))
      cp.cuda.Stream.null.synchronize()
      s2 = 1 / exp_lz_sum
      s3 = cp.matmul(cp.transpose(training_dataset), cp.exp(predicted_labels))
      cp.cuda.Stream.null.synchronize()
      s2_s3 = s2 * s3 # s2 is a scalar value
      s1.reshape(np.shape(cp.asnumpy(s1))[0], 1)
      cp.cuda.Stream.null.synchronize()
      s1s2s3 = cp.add(s1, s2_s3)
      cp.cuda.Stream.null.synchronize()
      return s1s2s3

    def relu(matrix):
      return cp.maximum(0, matrix)

    def get_groups_id_count(groups_id):
      current_group = 1
      group_counter = 0
      groups_id_counter = []
      for element in groups_id:
        if element != current_group:
          groups_id_counter.append((current_group, group_counter))
          current_group += 1
          group_counter = 1
        else:
          group_counter += 1
      return groups_id_counter

    def mul_matrix(matrix1, matrix2):
      return cp.matmul(matrix1, matrix2)

if mode == 'train': # Train MLP
  number_of_features = np.shape(training_set_data)[1]

  # Input neurons are equal to the number of training dataset features
  input_neurons = number_of_features
  # Assuming that number of hidden neurons are equal to the number of training dataset (input neurons) features + 10
  hidden_neurons = number_of_features + 10

  # Weights random initialization
  input_hidden_weights = cp.array(np.random.rand(number_of_features, hidden_neurons) * init_var)
  # Assuming that number of output neurons is 1
  hidden_output_weights = cp.array(np.float32(np.random.rand(hidden_neurons, 1) * init_var))

  listwise_gradients = np.array([])

  for epoch in range(0, 70):
    print('Epoch {0} started...'.format(epoch))
    start_range = 0
    for group in groups_id_count:
      end_range = (start_range + group[1]) # Batch is a group of words with same group id
      batch_dataset = cp.array(training_set_data[start_range:end_range, :])
      cp.cuda.Stream.null.synchronize()
      batch_labels = cp.array(dataset_labels[start_range:end_range])
      cp.cuda.Stream.null.synchronize()
      input_hidden_mul = mul_matrix(batch_dataset, input_hidden_weights)
      cp.cuda.Stream.null.synchronize()
      hidden_neurons_output = relu(input_hidden_mul)
      cp.cuda.Stream.null.synchronize()
      mlp_output = relu(mul_matrix(hidden_neurons_output, hidden_output_weights))
      cp.cuda.Stream.null.synchronize()
      batch_gradient = get_listnet_gradient(batch_dataset, batch_labels, mlp_output)
      batch_gradient = cp.mean(cp.transpose(batch_gradient), axis=1)
      aggregated_listwise_gradient = cp.sum(batch_gradient, axis=0)
      cp.cuda.Stream.null.synchronize()
      hidden_output_weights = hidden_output_weights - (learning_rate * aggregated_listwise_gradient)
      cp.cuda.Stream.null.synchronize()
      input_hidden_weights = input_hidden_weights - (learning_rate * aggregated_listwise_gradient)
      cp.cuda.Stream.null.synchronize()
      start_range = end_range

      listwise_gradients = np.append(listwise_gradients, cp.asnumpy(aggregated_listwise_gradient))

  print('Gradients: ', listwise_gradients)

I'm using cp.cuda.Stream.null.synchronize() because I read that this statement ensures that the code finishes executing on the GPU before going to the next line.

Could anyone help me to run the code on GPU? Thanks in advance

Victor Deleau · Accepted Answer · 2020-06-22T19:26:22.950

7

cupy can run your code on different devices. You need to select the right device ID associated with your GPU in order for your code to execute on it. I think that one of those device is your CPU (possibly with ID 0). You can check your current device ID using:

x = cp.array([1, 2, 3])
print(x.device)

To get the number of recognized devices that you have on your machine:

print(cp.cuda.runtime.getDeviceCount())

To change your current device to ID 1 for example:

cp.cuda.Device(1).use()

Device ID are zero indexed, therefore if you have 3 devices, you get an ID set {0, 1, 2}.

edited Jun 22 '20 at 19:26

answered Feb 02 '20 at 16:46

Victor Deleau

875
5
17

Thanks for your answer. ```cp.cuda.runtime.getDeviceCount()```returns 1, so have I to use the device at index 0, right? – pairon Feb 02 '20 at 17:15
If you only have one GPU, then yes I guess index 0 would the other device, which is hopefully your GPU – Victor Deleau Feb 02 '20 at 17:46
1

Ok thanks, I'm trying the device 0. ```nvidia-smi```still says that are no process on GPU. – pairon Feb 02 '20 at 17:51
I don't know why. Vectorized computation must be way faster on your GPU. If you try to run your code on each device, the fastest must be your GPU – Victor Deleau Feb 02 '20 at 17:56
I'm switch from CuPy to MinPy, that is another implementation of NumPy on GPU. However I see that the GPU is used at 8%. Is it possibile that NumPy limits the use of GPU? – pairon Feb 03 '20 at 22:23
MinPy/Numpy is not limiting the use of the GPU. Your GPU has hundreds of core and is made to compute massively in parralel, and deep learning isn't able to benefit entirely from that. Training large DL models on GPU generally don't use more than 10% of it's available ressources. You can optimize and get a few 1% here and there, but that's pretty much it – Victor Deleau Feb 05 '20 at 03:06

How to run python on GPU with CuPy?

1 Answers1