I'm trying to implement a convolutional neural network in Python. Originally, I was using scipy.signal's convolve2d function to do the convolution, but it has a lot of overhead, and it would be faster to just implement my own algorithm in C and call it from python, since I know what my input looks like.
I've implemented 2 functions:
- Convolving a matrix with a non-separable kernel
- Convolving a matrix with a separable kernel (For now I've assumed python does the rank checking and splitting before passing it onto C)
Neither of these functions has padding since I require dimensionality reduction.
Non-separable 2D Convolution
// a - 2D matrix (as a 1D array), w - kernel
double* conv2(double* a, double* w, double* result)
{
register double acc;
register int i;
register int j;
register int k1, k2;
register int l1, l2;
register int t1, t2;
for(i = 0; i < RESULT_DIM; i++)
{
t1 = i * RESULT_DIM; // loop invariants
for(j = 0; j < RESULT_DIM; j++)
{
acc = 0.0;
for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
{
t2 = k1 * FILTER_DIM; // loop invariants
for(l1 = FILTER_DIM - 1, l2 = 0; l1 >= 0; l1--, l2++)
{
acc += w[t2 + l1] * a[(i + k2) * IMG_DIM + (j + l2)];
}
}
result[t1 + j] = acc;
}
}
return result;
}
Separable 2D Convolution
// a - 2D matrix, w1, w2 - the separated 1D kernels
double* conv2sep(double* a, double* w1, double* w2, double* result)
{
register double acc;
register int i;
register int j;
register int k1, k2;
register int t;
double* tmp = (double*)malloc(IMG_DIM * RESULT_DIM * sizeof(double));
for(i = 0; i < RESULT_DIM; i++) // convolve with w1
{
t = i * RESULT_DIM;
for(j = 0; j < IMG_DIM; j++)
{
acc = 0.0;
for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
{
acc += w1[k1] * a[k2 * IMG_DIM + t + j];
}
tmp[t + j] = acc;
}
}
for(i = 0; i < RESULT_DIM; i++) // convolve with w2
{
t = i * RESULT_DIM;
for(j = 0; j < RESULT_DIM; j++)
{
acc = 0.0;
for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
{
acc += w2[k1] * tmp[t + (j + k2)];
}
result[t + j] = acc;
}
}
free(tmp);
return result;
}
Compiling with gcc's -O3 flag and testing on a 2.7GHz Intel i7, using a 4000x4000 matrix and 5x5 kernel, I get respectively (avg of 5):
271.21900 ms
127.32000 ms
This is still a considerable improvement over scipy.signal's convolve2d which takes around 2 seconds for the same operation, but I need more speed since I'll be calling this function thousands of times. Changing the data type to float isn't an option at the moment, even though it'd cause a considerable speedup.
Is there a way I can optimise these algorithms further? Can I apply any cache tricks or routines to speed it up?
Any suggestions would be appreciated.