5

currently I'm dealing with a video processing software in which the picture data (8bit signed and unsigned) is stored in arrays of 16-aligned integers allocated as

__declspec(align(16)) int *pData = (__declspec(align(16)) int *)_mm_malloc(width*height*sizeof(int),16);

Generally, wouldn't it enable faster reading and writing if one used signed/unsigned char arrays like this?:

__declspec(align(16)) int *pData = (__declspec(align(16)) unsigned char *)_mm_malloc(width*height*sizeof(unsigned char),16);

I know little about cache line size and data transfer optimization, but at least I know that it is an issue. Beyond that, SSE will be used in future, and in that case char-arrays - unlike int arrays - are already in a packed format. So which version would be faster?

Ashwin Nanjappa
  • 76,204
  • 83
  • 211
  • 292
  • 2
    why don't you benchmark it, there are a ton of things that in theory can work but in practice have different effects. Test it and see for yourself which is faster. No one can answer this question if he has a clue about performance...the environment is the one that decides the final performance. – Pop Catalin Sep 26 '08 at 08:16
  • What we think doesn't really matter. You need to run a benchmark to get your answer. – Andy Brice Sep 26 '08 at 08:19
  • Both versions will run at same speed. It's "final" type that matters and it's still int*, and it'll be treated by compiler as such. Plus, in the second version you may have problem with buffer overruns (you allocated 4x less memory than in the first version, is it enough?) – yrp Sep 26 '08 at 08:24
  • I have removed the C++ tag from your question since it has nothing to do with that language. – Ashwin Nanjappa Sep 26 '08 at 09:00
  • 5
    In this case, I really hate the "just benchmark it!" answers. Yes, that's the ultimate answer, but it's also incredibly unhelpful if that's the only answer you can get around here for questions of speed. Dark Shikari's response, however, gives a lot more. – Anthony Nov 02 '09 at 07:51

4 Answers4

5

If you're planning to use SSE, storing the data in its native size (8-bit) is almost certainly a better choice, since loads of operations can be done without unpacking, and even if you need to unpack for pmaddwd or other similar instructions, its still faster because you have to load less data.

Even in scalar code, loading 8-bit or 16-bit values is no slower than loading 32-bit, since movzx/movsx is no different in speed from mov. So you just save memory, which surely can't hurt.

Dark Shikari
  • 7,941
  • 4
  • 26
  • 38
1

It really depends on your target CPU -- you should read up on its specs and run some benchmarks as everyone has already suggested. Many factors could influence performance. The first obvious one that comes to my mind is that your array of ints is 2 to 4 times larger than an array of chars and, hence, if the array is big enough, you'll get fewer data cache hits, which will definitely slow down the performance.

Alexander
  • 9,302
  • 2
  • 26
  • 22
-1

on the contrary, packing and unpacking is CPU commands expensive.

if you want to make a lot of a random pixel operations - it is faster to make it an array of int so that each pixel has its own address.

but if you iterate through your image sequencly you want to make a chars array so that it is small in size and reduces the chances to have a page fault (Especially for large images)

akiva
  • 2,677
  • 3
  • 31
  • 40
  • Every `char` has its own address. Semi-related: [Can modern x86 hardware not store a single byte to memory?](https://stackoverflow.com/questions/46721075/can-modern-x86-hardware-not-store-a-single-byte-to-memory) Answer: yes it can, people claiming that CPUs can only do whole-word loads/stores are wrong. – Peter Cordes Nov 22 '17 at 11:09
-1

Char arrays can be slower in some cases. As a very general rule of thumb, the native word size is the best to go for, which will more than likely be 4-byte (32-bit) or 8-byte (64-bit). Even better is to have everything aligned to 16-bytes as you have already done... this will enable faster copies if you use SSE instructions (MOVNTA). If you are only concerned with moving items around this will have a much greater impact than the type used by the array...

jheriko
  • 3,043
  • 1
  • 21
  • 28
  • [Can modern x86 hardware not store a single byte to memory?](https://stackoverflow.com/questions/46721075/can-modern-x86-hardware-not-store-a-single-byte-to-memory) Answer: yes it can, and highly efficiently. So can other modern architectures: everything has load byte (with zero extension) and store byte, except early versions DEC Alpha AXP which famously lacked byte load/store instructions. – Peter Cordes Nov 22 '17 at 11:12