I'm guessing you are just looking for understanding and not battling a real performance issue... this just wouldn't show up under measurement and here's why:
Normally whenever a cached memory processor (i.e. most of today's desktop CPUs) has to write a value to memory, the cache line that contains the address must be read from (relatively slow) RAM. The value is then modified by a CPU write to the cache. The entire cache line is eventually written back to main RAM.
When you are performing operations over a range of continuous addresses like your array, the CPU will be able to perform several operations very quickly over one cache line before it is written back. It then moves on to the next cache line which was previously fetched in anticipation.
Most likely performing the test before writing the value will not be measurably different than just writing for several reasons:
- Branch prediction makes this process extremely efficient.
- The compiler will have done some really powerful optimizations.
- The memory transfer to cache RAM will be the real rate determining step.
So just write your code for clarity. Measure the difference if you are still curious.