Cupy API and elementwise kernel for faster neural network

Question

I know I can easily rewrite numpy dot product in cupy dot product by using corresponding API:

import numpy as np
import cupy as cp

arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
np.dot(arr1, arr2)

arr1_ = cp.asarray(arr1)
arr2_ = cp.asarray(arr2)
cp.dot(arr1_, arr2_)

I read that elementwise kernels in cupy run a lot faster (over hundreds times faster than corresponding numpy dot product). So I was guessing Q1. if I can do above dot product using elementwise kernel, or its that dot product is simply not an elementwise operation?

I wanted to increase the execution speed of neural network that I coded from scratch in numpy (for academic purpose and I dont want to use pytorch or tensorflow). And most of operations in neural network computation involve dot product. Q2. So, if we cannot use cupy elementwise kernel for dot product, then for what else I can use them (in the context of neural networks involving multiclass classification)?

Q3. Is there any faster alternative in cupy for cupy.dot? (Just want to be sure I am utilising fastest approach)

`numpy` `dot/matmul` is fast, since it uses `BLAS` code (at least for float dtype arrays). `dot` is not an element-wise multiplication, `A*B`. Your link doesn't mention `dot`. There is a bar in graph for `matrix multiplication`. — hpaulj, Mar 04 '23 at 20:47
What do you mean by "`numpy` `dot/matmul` is fast"? `numpy.dot` is faster than `cupy.dot`? Does the "matrix multiplication" bar in that graph refer dot product and mean `cupy.dot` is almost 20 times faster than `numpy.dot`? — MsA, Mar 04 '23 at 21:02
A possibly relevant SO: https://stackoverflow.com/questions/68754407/why-are-cuda-gpu-matrix-multiplies-slower-than-numpy-how-is-numpy-so-fast — hpaulj, Mar 04 '23 at 22:39
@hpaulj super thanks for pointing me to great post!!! It definitely proves `cupy.dot` is way faster than `numpy.dot` on gpu. I feel Q1 and Q3 are somewhat answered: we cannot use elementwise kernel for dot product and `cupy.dot` is very fast and may not have any faster alternative. But I guess Q2 is still remained. Can you please imagine ways where we can you elementwise kernel in the context neural network implementation from scratch? — MsA, Mar 05 '23 at 08:59
In `einsum` terminology, matrix multiplication is 'ij,jk->ik'. That can be thought of as 'ij1' ' * ' '1jk' producing a 'ijk', followed by a 'j' sum, where the star is a broadcasted elementwise multiplication. But 'highly optimized BLAS code' takes a more direct approach, with low level iterations on the shared 'j' — hpaulj, Mar 05 '23 at 16:05

Cupy API and elementwise kernel for faster neural network

0 Answers0