4

If I perform a computation written in C, something like matrix-matrix addition or matrix-matrix multiplication, where the work is done in a for loop, with the same number and type of arithmetic operations happening on each iteration, could the specific values of the input data impact the speed of the computation? For example, if the matrix elements are 32-bit integers with values between 0 and 127, so that their representations vary only in one out of each four corresponding bytes, would that run faster than if the values varied between 0 and INT_MAX (supposing no undefined behavior occurs)?

If I had to guess I would say no, because whether or not the value is small or very large, since it is defined as a 32-bit integer it will be masked to fit in the same amount of memory space so the small and large values will end up having the same amount of bytes in memory. Am I thinking about this correctly? Is it different for single/double precision floating point numbers?

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
Shaun Holtzman
  • 201
  • 2
  • 9
  • 1
    It is not clear what your problem is. If you want to benchmark, read how to do it properly. If you have other problems, you should be more specific. – too honest for this site Dec 14 '16 at 21:32
  • 6
    On modern CPU hardware, the only operations with data-dependent performance are (int and FP) division and sqrt. (They are often variable latency, and not fully pipelined; see [the recent Collatz-conjecture hand-written asm question for more about DIV vs. shifts, for example](http://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat/40355466#40355466)). add/sub/mul/shift are all fixed latency/throughput. This is definitely true on all current [x86 CPUs](http://agner.org/optimize), and almost certainly true on most other systems. – Peter Cordes Dec 14 '16 at 21:33
  • @Olaf I don't have a specific problem, I'm just curious. I took off the benchmarking tag in case it made it seem like I have a specific benchmarking question. I figured a kernel benchmark would be a good example to illustrate what I was trying to ask. – Shaun Holtzman Dec 14 '16 at 21:35
  • @PeterCordes has it right. On most modern architectures, *type* certainly matters, but not the *value* for most of the basic ALU operations. – DBPriGuy Dec 14 '16 at 21:36
  • 2
    @DBPriGuy: It does for e.g. division matter a lot. Worse: on high-end CPUs value can matter even for single-cycle operations: a lot of switching nodes in the CPU generates more heat, which can result in reduced clock for e.g. slim-line notebooks or mobile phones. Or even for normal desktops if they use turbo-speed modes. – too honest for this site Dec 14 '16 at 21:37
  • @PeterCordes: Does that include operations on NaNs, infinities, and subnormals? – user2357112 Dec 14 '16 at 21:38
  • @Olaf: I agree, which is why I said *most*. – DBPriGuy Dec 14 '16 at 21:39
  • @DBPriGuy: Which architectures other than Core-i, ARMv7A/v8A, current AMD (and most probably Zen), POWER or GPUs from Nvidia and AMD do you consider "most modern"? – too honest for this site Dec 14 '16 at 21:40
  • @Olaf: Sigh, we seem to be having a hard time understanding each other. I was referring to most operations, I apologize for my lack of clarity. Essentially, I am expressing agreement with Peter Cordes. – DBPriGuy Dec 14 '16 at 21:44
  • @DBPriGuy: I refer to **all** operations. Without any offence: this is about digital electronics, CMOS technology, power management and dynamic clocking; you need very deep knowledge about this to understand the problems. Things have become very complicated since CPUs change their clock frequency extremely dynamically with power consumption and die temperature. Some even change it depending on the physical region which heats up, i.e. which part of the die the rise in temperature is and from one clock cycle to another. The clock can change by 50 or more percent. – too honest for this site Dec 14 '16 at 21:52
  • 1
    Even if the per-operation cost does not vary with the operand values, if the result of the operations are used in a branch condition then having always the same operand values will cause the branch to always go the same way. A CPU that performs dynamic branch prediction to attempt to speed up execution (as all modern PC-class CPUs do) would likely run that faster than if the relevant operand values were random. – John Bollinger Dec 14 '16 at 22:02
  • @JohnBollinger The values of the matrix all being the same was just a way for me to sort of simplify the question and make it concise. I am really just interested in the amount of time a calculation would take from start to finish. For example: `for(i=0;i – Shaun Holtzman Dec 14 '16 at 22:08
  • @ShaunHoltzman, then I have no idea what you're actually asking. As far as I can tell, the effect of the values being the same is not at all incidental to the question -- on the contrary, it seems to be the whole focus. – John Bollinger Dec 14 '16 at 22:13
  • @JohnBollinger What I'm trying to get at is if I'm, for example, adding up two 32-bit matrices with large numbers, A[0] = 1,021,000,398 A[1] = 1,278,236,282...+B[0] = 2,102,123,010 B[1] = 1,009,876,148... in a for loop, will that be slower than doing the same for loop with smaller 32-bit numbers such as, A[0] = 3 A[1] = 2 ...etc. + B[0] = 9 B[1] = 6. The numbers are the same data type (32 bit integers) but in one case they are all very large and in the other case they are small. Would the performance differ, or would it be similar since they are all still 32-bit integers in memory – Shaun Holtzman Dec 14 '16 at 22:30
  • @Olaf: So you are in disagreement with Peter Cordes, then? – DBPriGuy Dec 14 '16 at 22:33
  • 2
    @DBPriGuy: No. Peter talks about clocks, which is correct in a synchronous CPU (which all major CPUs are). I talk talk about time. Not the number of cycles can vary, but the length of each cycle. Note that OP asks about time (which is reasonable for a program). But there are many other influences in a system. Proper benchmarking is not a beginner's job. "Benchmarks do lie, liars do benchmarks" - this proverb has a very serious background. Today way more than the time it originates from (I know this since the 1980ies; it might very well be much older). – too honest for this site Dec 14 '16 at 22:39
  • @Olaf: Ah, I reread everything you wrote and now I understand what you are saying. Thanks for sharing your knowledge! – DBPriGuy Dec 14 '16 at 22:47
  • 1
    @ShaunHoltzman, in that case, I have edited your question to more clearly convey what I now understand you to be asking. – John Bollinger Dec 14 '16 at 22:50
  • @JohnBollinger thanks, your edit definitely helps articulate my question better. – Shaun Holtzman Dec 14 '16 at 23:03
  • @JohnBollinger: great point, Shaun should have a look at SO's highest-voted question: [Why is it faster to process a sorted array than an unsorted array?](http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array). – Peter Cordes Dec 15 '16 at 00:08
  • 1
    @user2357112: good question. On **x86 with SSE: NaN and Infinity as inputs/outputs: no penalty**. On x86 with x87 legacy FPU: NaN and Infinity: [big slowdown](http://stackoverflow.com/questions/31875464/huge-performance-difference-26x-faster-when-compiling-for-32-and-64-bits/31879376#31879376). **Denormals**: depends on the microarchitecture, but many do trap to microcode to handle it. Intel Sandybridge has fast denormals for add/subtract, but not multiply. IIRC, AMD always does denormals in hardware. Of course, setting FTZ (flush to zero) and DAZ (input denormals are zero) avoids these. – Peter Cordes Dec 15 '16 at 00:19

1 Answers1

1

The data contained within a variable will not affect the performance it is just a storage space. If the data in this storage space has to be manipulated in some fashion you could encounter some performance issues. Here are a couple of things to consider that could affect performance related to this.

  1. Depending on the hardware architecture of the CPU (i.e. HW FPU vs SW floating point implementation, pipelined vs not pipelined etc...) performing a CPU intensive mathematical operation (i.e. division, square root, etc...) on different data sets could affect performance.

  2. Another scenario that could affect performance is where that variable resides in memory. Is the variable byte aligned in a way that it will only take one instruction cycle to access the data? Or is it misaligned so it takes an extra instruction cycle to retrieve all the data. Often in these cases the compiler will have special keywords or even pragma's that can force variables to be aligned for optimization.

Best Regards!

Brandon83
  • 206
  • 1
  • 7
  • 2
    Normal compilers for normal platforms (like x86) always align integers, floats and double to their natural alignment (e.g. 8 bytes for 8-byte values). You have to specifically use `__attribute__((packed))` or a similar pragma to override this inside a struct. Arranging your data in structs to avoid padding can make the difference between the whole struct fitting in one cache line or not, though, leading to larger-scale cache hit rate effects over multiple structs. – Peter Cordes Dec 15 '16 at 00:05
  • Great point! I work in the embedded world mostly and unfortunately this isn't always the case :(. The alignment needs to be forced by the developer and accounted for as I have mentioned above. – Brandon83 Dec 15 '16 at 00:17
  • There is nothing wrong with expanding on the OP's question and providing extra information which is related to the original question. Don't troll and mark down answers where you have not provided any useful information on the topic or constructive criticism on how to make the answer better. – Brandon83 Dec 15 '16 at 03:18
  • I agree that additional information can be useful, but only if the question is answered too. Stop trying to tell me what to do, anyone if free to vote for any reason. But I actually provided a suggestion to make your answer better: Answer the question. – alain Dec 15 '16 at 09:46
  • @alain: I have updated my response to clearly answer the question from feedback provided in the comments section. I have also provided some additional information to consider that could affect performance. Thank you for your input! – Brandon83 Dec 15 '16 at 13:34