2

I am working on a project involving object detection through deep learning, with the underlying detection code written in C. Due to the requirements of the project, this code has a Python wrapper around it, which interfaces with the required C functions through ctypes. Images are read from Python, and then transferred into C to be processed as a batch.

In its current state, the code is very unoptimized: the images (640x360x3 each) are read using cv2.imread then stacked into a numpy array. For example, for a batch size of 16, the dimensions of this array are (16,360,640,3). Once this is done, a pointer to this array is passed through ctypes into C where the array is parsed, pixel values are normalized and rearranged into a 2D array. The dimensions of the 2D array are 16x691200 (16x(640*360*3)), arranged as follows.

row [0]: Image 0: (B)r0(B)r1(B)r2.... (G)r0(G)r1(G)r2.... (R)r0(R)r1(R)r2....
row [1]: Image 1: (B)r0(B)r1(B)r2.... (G)r0(G)r1(G)r2.... (R)r0(R)r1(R)r2....
.
.
row [15]: Image 15: (B)r0(B)r1(B)r2.... (G)r0(G)r1(G)r2.... (R)r0(R)r1(R)r2....

`

The C code for doing this currently looks like this, where the pixel values are accessed through strides and arranged sequentially per image. nb is the total number of images in the batch (usually 16); h, w, c are 360,640 and 3 respectively.

matrix ndarray_to_matrix(unsigned char* src, long* shape, long* strides)
{
int nb = shape[0];
int h = shape[1];
int w = shape[2];
int c = shape[3];
matrix X = make_matrix(nb, h*w*c);

int step_b = strides[0];
int step_h = strides[1];
int step_w = strides[2];
int step_c = strides[3];

int b, i, j, k;
int index1, index2 = 0;

for(b = 0; b < nb ; ++b) {
    for(i = 0; i < h; ++i) {
        for(k= 0; k < c; ++k) {
            for(j = 0; j < w; ++j) {
                index1 = k*w*h + i*w + j;
                index2 = step_b*b + step_h*i + step_w*j + step_c*k;
                X.vals[b][index1] = src[index2]/255.;
            }
        }
    }
}
return X;
}

And the corresponding Python code that calls this function: (array is the original numpy array)

for i in range(start, end):
    imgName = imgDir + '/' + allImageName[i]
    img = cv2.imread(imgName, 1)
    batchImageData[i-start,:,:] = img[:,:]

data = batchImageData.ctypes.data_as(POINTER(c_ubyte))
resmatrix = self.ndarray_to_matrix(data, batchImageData.ctypes.shape, batchImageData.ctypes.strides)

As of now, this ctypes implementation takes about 35 ms for a batch of 16 images. I'm working on a very FPS critical image processing pipeline, so is there a more efficient way of doing these operations? Specifically:

  1. Can I read the image directly as a 'strided' one dimensional array in Python from disk, thus avoiding the iterative access and copying?
  2. I have looked into numpy operations such as: np.ascontiguousarray(img.transpose(2,0,1).flat, dtype=float)/255. which should achieve something similar, but this is actually taking more time possibly because of it being called in Python.
  3. Would Cython help anywhere during the read operation?
HighVoltage
  • 722
  • 7
  • 25

1 Answers1

2

Regarding the ascontiguousarray method, I'm assuming that it's pretty slow as python has to do some memory works to return a C-like contiguous array.

EDIT 1: I saw this answer, apparently openCV's imread function should already return a contiguous array.

I am not very familiar with ctypes, but happen to use the PyBind library and can only recommend using it. It implements Python's buffer protocol hence allowing you to interact with python data with almost no overhead.

I've answered a question explaining how to pass a numpy array from Python to C/C++, do something dummy to it in C++ and return a dynamically created array back to Python.

EDIT 2 : I've added a simple example that receives a Numpy array, send it to C and prints it from C. You can find it here. Hope it helps!

EDIT 3 : To answer your last comment, yes you can definitely do that. You could modify your code to (1) instantiate a 2D numpy array in C++, (2) pass its reference to the data to your C function that will modify it instead of declaring a Matrix and (3) return that instance to Python by reference.

Your function would become:

void ndarray_to_matrix(unsigned char* src, double * x, long* shape, long* strides)
{
int nb = shape[0];
int h = shape[1];
int w = shape[2];
int c = shape[3];

int step_b = strides[0];
int step_h = strides[1];
int step_w = strides[2];
int step_c = strides[3];

int b, i, j, k;
int index1, index2 = 0;

for(b = 0; b < nb ; ++b) {
    for(i = 0; i < h; ++i) {
        for(k= 0; k < c; ++k) {
            for(j = 0; j < w; ++j) {
                index1 = k*w*h + i*w + j;
                index2 = step_b*b + step_h*i + step_w*j + step_c*k;
                X.vals[b][index1] = src[index2]/255.;
            }
        }
    }
}
}

And you'd add, in your C++ wrapper code

// Instantiate the output array, assuming we know b, h, c,w
py::array_t<double> x = py::array_t<double>(b*h*c*w);
py::buffer_info bufx = x.request();
double*ptrx = (double *) bufx.ptr;

// Call to your C function with ptrx as input
ndarray_to_matrix(src, ptrx, shape, strides);

// now reshape x
x.reshape({b, h*c*w});

Do not forget to modify the prototype of the C++ wrapper function to return a numpy array like:

py::array_t<double> read_matrix(...){}...

This should work, I didn't test it though :)

Christian
  • 1,162
  • 12
  • 21
  • It seems like pybind only works with C++ (according to the website)? Also, I know that the C++ version of imread() stores the data as a contiguous array, but Python seems to do it differently. – HighVoltage Apr 30 '18 at 14:22
  • I'm using C code with PyBind, but you're right that the wrapper itself has to be C++. I include my C modules with the `Extern C` keyword. If I have time later I'll setup a quick example, that hopefully could be useful for you. – Christian Apr 30 '18 at 14:24
  • Could you check out the code I've put on github? (see EDIT 2 in my answer above) – Christian May 01 '18 at 06:19
  • Thank you for the code! I will test it and let you know how it works in terms of performance – HighVoltage May 02 '18 at 00:26
  • You're welcome :) I'm very curious to see how it's going to work (you will be interested in the folder "example1" of the repo, that's the one including C code). – Christian May 02 '18 at 06:24
  • just tested your code with a 4D array like in my case, and it works! I am curious about how I can integrate this into my existing setup so I can compare the timings: as you can see in my original code, I rearrange this 4D data into a 2D matrix and return a pointer to it back into Python. Can I achieve something similar through pybind? – HighVoltage May 03 '18 at 01:40
  • I just timed the whole process, and the pybind implementation is taking 170ms whereas the ctypes one which takes 35ms for the matrix conversion. Do you think I might be doing something wrong? – HighVoltage May 03 '18 at 17:11
  • Mmmh I'm not too sure, in the timing process, do you take the time needed to pass data from python to C/C++ and back ? Could you maybe share the code ? – Christian May 03 '18 at 21:21
  • Yes, I do, but same thing in ctypes. I just measure the time elapsed for the outermost python function. My code can be seen at the fork https://github.com/saihv/pybind_examples . Thanks again – HighVoltage May 03 '18 at 21:56
  • You're most welcome, I'm also learning here :) I'll have a look at your code and see if I understand why it's this slow. – Christian May 04 '18 at 06:32
  • would there be a way for me to see your `ctypes` code ? – Christian May 07 '18 at 11:51