I checked and could find numerous posts about performance of float vs double (here is one, and here is another). In most cases, it is said that they have the same performance because they are converted by FPU to 10-bye real numbers. But I'm still not convinced. What if the locality issues are considered appropriately? Consider doing bitwise XOR on large number of bits, counting none 0 bits will take considerably less time when the data fits the cache (float). Doing XOR and bit population count with regular (non SIMD instructions) will drive processing time a lot longer. I tried to write some test to confirm it, but it is not easy to get everything right.
One question is does these two types converted to the same size in the cache?
In general, I was wondering if anyone can characterize the behavior of these 2 choices in different situations?