1

I have an image represented by a 2D array of chars, I need to perform some operations on this image and store the result in another 2D array, these operations vary from calculating average value of neighbour cells to reordering rows. what are possible optimisations I can do to get better performance? any possible techniques are welcome (e.g locality of references, inline assembly, ...)

I use c on a linux x86_64 machine

PS: I have raw color images with each pixel represented by a group of RGB values.

arash kordi
  • 2,470
  • 1
  • 22
  • 24
  • 2
    It's difficult to come up with one generic solution for all operations. You have to be specific on which operation you want to optimize. – Aziz Nov 05 '12 at 08:36
  • 2
    If your operation is local, and only requires a small rectangle piece of the image, it may be faster to extract a small image, perform your operations on it, and paste the image back at the right place. This may improve the cache access, since on big images, elements on different rows are far away in memory, and therefore not automatically loaded in cache when you access a neighbouring pixel on a different row. – Vincent Nivoliers Nov 05 '12 at 08:39
  • If performance is very important to you (and you are willing to spend hours or days of work for it), you could consider using `OpenCL` if you have a powerful graphics card and run some OpenCL kernel on it. Otherwise, use a recent GCC (e.g. 4.7) and fine-tune optimization flags. – Basile Starynkevitch Nov 05 '12 at 08:52

2 Answers2

2
  • First, allocate 4 bytes per rgb -triplet.
  • Have a pointer for the beginning of each row, then shuffling can be done with pointer exchange
  • plan your API as void process_row(int *out_row, int *in_row, int *row_above, int *row_below);

  • some RGB calculations can be calculated in parallel with normal integer arithmetic.

    • e.g. 00rrrrrrrr 00bbbbbbbb 00gggggggg fits a single 32-bit integer.
      One can add 4 of those without overflow, then simply shift the result right by 2 bits
      and mask with 0011111111 0011111111 0011111111 b
      (this of course requires pre- and postprocessing the data)
    • on 64-bit processor packing of 00 rr 00 gg 00 bb 00 aa is also feasible approach
  • possibly even easier approach is to use __sse intrinsics, from which there is just a small step to...
  • inline assembler
  • there is also a package ORC, which implements portable SIMD.
Community
  • 1
  • 1
Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
0

You can use 1D array instead of the 2D array, which will improve the cache hit ratio a lot. This will optimize the program running time.

poet
  • 21
  • 1
  • 3