Extended example to understand CUDA, Numba, Cupy, etc

Question

Mostly all examples of Numba, CuPy and etc available online are simple array additions, showing the speedup from going to cpu singles core/thread to a gpu. And commands documentations mostly lack good examples. This post is intended to provide a more comprehensive example.

The initial code is provided here. Its a simple model for the classic Cellular Automata. Originally, it doesn't even uses numpy, just plain python and the Pyglet module for visualization.

My goal is to extend this code to a specific problem (that will be very large), but first i thought its best to already optimize for GPU usage.

The game_of_life.py is this:

import random as rnd
import pyglet
#import numpy as np
#from numba import vectorize, cuda, jit

class GameOfLife: 
 
    def __init__(self, window_width, window_height, cell_size, percent_fill):
        self.grid_width = int(window_width / cell_size) # cell_size 
        self.grid_height = int(window_height / cell_size) # 
        self.cell_size = cell_size
        self.percent_fill = percent_fill
        self.cells = []
        self.generate_cells()
  
    def generate_cells(self):
        for row in range(0, self.grid_height): 
            self.cells.append([])
            for col in range(0, self.grid_width):
                if rnd.random() < self.percent_fill:
                    self.cells[row].append(1)
                else:
                    self.cells[row].append(0)
                
    def run_rules(self): 
        temp = []
        for row in range(0, self.grid_height):
            temp.append([])
            for col in range(0, self.grid_width):
                cell_sum = sum([self.get_cell_value(row - 1, col),
                                self.get_cell_value(row - 1, col - 1),
                                self.get_cell_value(row,     col - 1),
                                self.get_cell_value(row + 1, col - 1),
                                self.get_cell_value(row + 1, col),
                                self.get_cell_value(row + 1, col + 1),
                                self.get_cell_value(row,     col + 1),
                                self.get_cell_value(row - 1, col + 1)])
                
                if self.cells[row][col] == 0 and cell_sum == 3:
                    temp[row].append(1)
                elif self.cells[row][col] == 1 and (cell_sum == 3 or cell_sum == 2):
                    temp[row].append(1)
                else:                 
                    temp[row].append(0)
        
        self.cells = temp

    def get_cell_value(self, row, col): 
        if row >= 0 and row < self.grid_height and col >= 0 and col < self.grid_width:
           return self.cells[row][col]
        return 0

    def draw(self): 
        for row in range(0, self.grid_height):
            for col in range(0, self.grid_width):
                if self.cells[row][col] == 1:
                    #(0, 0) (0, 20) (20, 0) (20, 20)
                    square_coords = (row * self.cell_size,                  col * self.cell_size,
                                     row * self.cell_size,                  col * self.cell_size + self.cell_size,
                                     row * self.cell_size + self.cell_size, col * self.cell_size,
                                     row * self.cell_size + self.cell_size, col * self.cell_size + self.cell_size)
                    pyglet.graphics.draw_indexed(4, pyglet.gl.GL_TRIANGLES,
                                         [0, 1, 2, 1, 2, 3],
                                         ('v2i', square_coords))

Firstly, i could use numpy adding at the end of generate_cells this self.cells = np.asarray(self.cells) and at end of run_rules this self.cells = np.asarray(temp), since doing this before wouldn't present speedups, as presented here.(Actually changing to numpy didn't present a noticeable speedup)

Regarding gpu's, for example, i added @jit before every function, and became very slow. Also tried to use @vectorize(['float32(float32, float32)'], target='cuda'), but this raised a question: how to use @vectorize in functions that only have self as input argument?

I also tried substituting numpy for cupy, like self.cells = cupy.asarray(self.cells), but also became very slow.

Following the initial idea of a extended example of gpu usage, what would be the proper approach to the problem? Where is the right place to put the modifications/vectorizations/parallelizations/numba/cupy etc? And most important, why?

Additional info: besides the code provided, here's the main.py file:

import pyglet
from game_of_life import GameOfLife 
 
class Window(pyglet.window.Window):
 
    def __init__(self):
        super().__init__(800,800)
        self.gameOfLife = GameOfLife(self.get_size()[0],
                                     self.get_size()[1],
                                     15,  # the lesser this value, more computation intensive will be
                                     0.5) 

        pyglet.clock.schedule_interval(self.update, 1.0/24.0) # 24 frames per second
 
    def on_draw(self):
        self.clear()
        self.gameOfLife.draw()
        
    def update(self, dt):
        self.gameOfLife.run_rules()
 
if __name__ == '__main__':
    window = Window()
    pyglet.app.run()

I have very limited understanding of the using the `cuda.jit` decorator, but it seems to me that the main cause of such a kernel underperforming is when transferring excessive data between the CPU and the GPU. To avoid this, one must pass only necessary variables, especially when talking about large arrays. I think that by using self as an argument for every function (would be kernel), you might be passing unnecessary data. Also, keep in mind that every thread operates on a single element of an array, so iterating over an array using `for` will not be parallelised. Hoping this somewhat helps. — boi, Aug 27 '20 at 18:56
@boi , thanks for point this out. I started in Python 3 months ago, and this is the first language I'm using classes. I've never used, its something new for me, even though i code for +10 years. Things like `self`, `_init_` and etc are new to me. I'll look more closely to correctly pass the arguments. Regarding `for`, do you know if Python have something similar to `parfor`, like Matlab? — rod_CAE, Aug 28 '20 at 12:04
Actually yes, `numba.prange` might be what you're looking for, although I don't think it is possible to parallelize a loop in `numba.cuda`. Here is the documentation: https://numba.readthedocs.io/en/stable/user/parallel.html?highlight=prange. I'm also quite new to all of this :). — boi, Aug 28 '20 at 13:43
@rod_CAE any update on this? I can also appreciate the newness of the `class` construct. — Sterling, Aug 20 '21 at 06:13
@Sterling Unfortunately not. The project continued without parallelization for a while, but its currently on hold because my other projects became more important than this one.... — rod_CAE, Aug 21 '21 at 14:25

score 1 · Answer 1 · answered Dec 03 '20 at 08:46

I don't quite understand your example, but I only need GPU computing. After a few days of pain, I may understand its usage, so I'll show it to you, hoping to help you. In addition, I need to point out that when using "...kernel(cuts, cuts", I will put two. Because the first one specifies the type when it is passed in, it will be used by the core as a traversal element and cannot be read by the index. So I use the second one to calculate free index data.

```
binsort_kernel = cp.ElementwiseKernel(
'int32 I,raw T cut,raw T ind,int32 row,int32 col,int32 q','raw T out,raw T bin,raw T num',    
'''
int i_x = i / col;                
int i_y = i % col;                
int b_f = i_x*col;                
int b_l = b_f+col;                
int n_x = i_x * q;                
int inx = i_x%row*col;            
////////////////////////////////////////////////////////////////////////////////////////
int r_x = 0; int adi = 0; int adb = 0;  
////////////////////////////////////////////////////////////////////////////////////////
if (i_y == 0)
{
for(size_t j=b_f; j<b_l; j++){
    if (cut[j]<q){                
        r_x = inx + j -b_f;       
        adb = n_x + cut[j];       
        adi = bin[adb] + num[adb];
        out[adi] = ind[r_x];      
        num[adb]+= 1;             
    }}
}
////////////////////////////////////////////////////////////////////////////////////////
''','binsort')

binsort_kernel(cuts,cuts,ind,row,col,q,iout,bins,bnum)

What does this answer address? binsort doesn't seem to be used in OP's post, but it looks like you're using CuPy and showing how you can use ElementwiseKernel with C code? — Sterling, Aug 20 '21 at 06:00

Extended example to understand CUDA, Numba, Cupy, etc

1 Answers1

Linked