7

Mostly all examples of Numba, CuPy and etc available online are simple array additions, showing the speedup from going to cpu singles core/thread to a gpu. And commands documentations mostly lack good examples. This post is intended to provide a more comprehensive example.

The initial code is provided here. Its a simple model for the classic Cellular Automata. Originally, it doesn't even uses numpy, just plain python and the Pyglet module for visualization.

My goal is to extend this code to a specific problem (that will be very large), but first i thought its best to already optimize for GPU usage.

The game_of_life.py is this:

import random as rnd
import pyglet
#import numpy as np
#from numba import vectorize, cuda, jit

class GameOfLife: 
 
    def __init__(self, window_width, window_height, cell_size, percent_fill):
        self.grid_width = int(window_width / cell_size) # cell_size 
        self.grid_height = int(window_height / cell_size) # 
        self.cell_size = cell_size
        self.percent_fill = percent_fill
        self.cells = []
        self.generate_cells()
  
    def generate_cells(self):
        for row in range(0, self.grid_height): 
            self.cells.append([])
            for col in range(0, self.grid_width):
                if rnd.random() < self.percent_fill:
                    self.cells[row].append(1)
                else:
                    self.cells[row].append(0)
                
    def run_rules(self): 
        temp = []
        for row in range(0, self.grid_height):
            temp.append([])
            for col in range(0, self.grid_width):
                cell_sum = sum([self.get_cell_value(row - 1, col),
                                self.get_cell_value(row - 1, col - 1),
                                self.get_cell_value(row,     col - 1),
                                self.get_cell_value(row + 1, col - 1),
                                self.get_cell_value(row + 1, col),
                                self.get_cell_value(row + 1, col + 1),
                                self.get_cell_value(row,     col + 1),
                                self.get_cell_value(row - 1, col + 1)])
                
                if self.cells[row][col] == 0 and cell_sum == 3:
                    temp[row].append(1)
                elif self.cells[row][col] == 1 and (cell_sum == 3 or cell_sum == 2):
                    temp[row].append(1)
                else:                 
                    temp[row].append(0)
        
        self.cells = temp

    def get_cell_value(self, row, col): 
        if row >= 0 and row < self.grid_height and col >= 0 and col < self.grid_width:
           return self.cells[row][col]
        return 0

    def draw(self): 
        for row in range(0, self.grid_height):
            for col in range(0, self.grid_width):
                if self.cells[row][col] == 1:
                    #(0, 0) (0, 20) (20, 0) (20, 20)
                    square_coords = (row * self.cell_size,                  col * self.cell_size,
                                     row * self.cell_size,                  col * self.cell_size + self.cell_size,
                                     row * self.cell_size + self.cell_size, col * self.cell_size,
                                     row * self.cell_size + self.cell_size, col * self.cell_size + self.cell_size)
                    pyglet.graphics.draw_indexed(4, pyglet.gl.GL_TRIANGLES,
                                         [0, 1, 2, 1, 2, 3],
                                         ('v2i', square_coords))

Firstly, i could use numpy adding at the end of generate_cells this self.cells = np.asarray(self.cells) and at end of run_rules this self.cells = np.asarray(temp), since doing this before wouldn't present speedups, as presented here.(Actually changing to numpy didn't present a noticeable speedup)

Regarding gpu's, for example, i added @jit before every function, and became very slow. Also tried to use @vectorize(['float32(float32, float32)'], target='cuda'), but this raised a question: how to use @vectorize in functions that only have self as input argument?

I also tried substituting numpy for cupy, like self.cells = cupy.asarray(self.cells), but also became very slow.

Following the initial idea of a extended example of gpu usage, what would be the proper approach to the problem? Where is the right place to put the modifications/vectorizations/parallelizations/numba/cupy etc? And most important, why?

Additional info: besides the code provided, here's the main.py file:

import pyglet
from game_of_life import GameOfLife 
 
class Window(pyglet.window.Window):
 
    def __init__(self):
        super().__init__(800,800)
        self.gameOfLife = GameOfLife(self.get_size()[0],
                                     self.get_size()[1],
                                     15,  # the lesser this value, more computation intensive will be
                                     0.5) 

        pyglet.clock.schedule_interval(self.update, 1.0/24.0) # 24 frames per second
 
    def on_draw(self):
        self.clear()
        self.gameOfLife.draw()
        
    def update(self, dt):
        self.gameOfLife.run_rules()
 
if __name__ == '__main__':
    window = Window()
    pyglet.app.run()
rod_CAE
  • 81
  • 5
  • I have very limited understanding of the using the `cuda.jit` decorator, but it seems to me that the main cause of such a kernel underperforming is when transferring excessive data between the CPU and the GPU. To avoid this, one must pass only necessary variables, especially when talking about large arrays. I think that by using self as an argument for every function (would be kernel), you might be passing unnecessary data. Also, keep in mind that every thread operates on a single element of an array, so iterating over an array using `for` will not be parallelised. Hoping this somewhat helps. – boi Aug 27 '20 at 18:56
  • @boi , thanks for point this out. I started in Python 3 months ago, and this is the first language I'm using classes. I've never used, its something new for me, even though i code for +10 years. Things like `self`, `_init_` and etc are new to me. I'll look more closely to correctly pass the arguments. Regarding `for`, do you know if Python have something similar to `parfor`, like Matlab? – rod_CAE Aug 28 '20 at 12:04
  • Actually yes, `numba.prange` might be what you're looking for, although I don't think it is possible to parallelize a loop in `numba.cuda`. Here is the documentation: https://numba.readthedocs.io/en/stable/user/parallel.html?highlight=prange. I'm also quite new to all of this :). – boi Aug 28 '20 at 13:43
  • @rod_CAE any update on this? I can also appreciate the newness of the `class` construct. – Sterling Aug 20 '21 at 06:13
  • @Sterling Unfortunately not. The project continued without parallelization for a while, but its currently on hold because my other projects became more important than this one.... – rod_CAE Aug 21 '21 at 14:25

1 Answers1

1

I don't quite understand your example, but I only need GPU computing. After a few days of pain, I may understand its usage, so I'll show it to you, hoping to help you. In addition, I need to point out that when using "...kernel(cuts, cuts", I will put two. Because the first one specifies the type when it is passed in, it will be used by the core as a traversal element and cannot be read by the index. So I use the second one to calculate free index data.

```
binsort_kernel = cp.ElementwiseKernel(
'int32 I,raw T cut,raw T ind,int32 row,int32 col,int32 q','raw T out,raw T bin,raw T num',    
'''
int i_x = i / col;                
int i_y = i % col;                
int b_f = i_x*col;                
int b_l = b_f+col;                
int n_x = i_x * q;                
int inx = i_x%row*col;            
////////////////////////////////////////////////////////////////////////////////////////
int r_x = 0; int adi = 0; int adb = 0;  
////////////////////////////////////////////////////////////////////////////////////////
if (i_y == 0)
{
for(size_t j=b_f; j<b_l; j++){
    if (cut[j]<q){                
        r_x = inx + j -b_f;       
        adb = n_x + cut[j];       
        adi = bin[adb] + num[adb];
        out[adi] = ind[r_x];      
        num[adb]+= 1;             
    }}
}
////////////////////////////////////////////////////////////////////////////////////////
''','binsort')

binsort_kernel(cuts,cuts,ind,row,col,q,iout,bins,bnum)

weidong
  • 159
  • 8
  • What does this answer address? binsort doesn't seem to be used in OP's post, but it looks like you're using CuPy and showing how you can use ElementwiseKernel with C code? – Sterling Aug 20 '21 at 06:00