By curiosity, I compared the last bit set in the significand (or in other words the number of trailing zeros in the significand)
I did it with Squeak Smalltalk because I'm more comfortable with it, but I'm pretty sure you can find equivalent libraries in Python:
CPU:
#(980998.375000 749440.750000 536876.875000 334921.250000 244632.437500)
collect: [:e | e asTrueFraction numerator highBit].
-> #(23 22 23 21 22)
GPU:
#(980998.3125 749440.812500 536876.812500 334921.281250 244632.453125)
collect: [:e | e asTrueFraction numerator highBit].
-> #(24 24 24 24 24)
That's interestingly not as random as we could expect, especially the GPU, but there is not enough clue at this stage...
Then I used an ArbitraryPrecisionFloat package to perform (emulate) the operations in extended precision, then round to nearest single precision float, the correct answer matches quite exactly the one of CPU:
#( 86312 67405 49652 32261 24232 ) collect: [:e |
| u r |
u := e asArbitraryPrecisionFloatNumBits: 80.
r = u*u ln.
(r asArbitraryPrecisionFloatNumBits: 24) asTrueFraction printShowingMaxDecimalPlaces: 100]
-> #('980998.375' '749440.75' '536876.875' '334921.25' '244632.4375')
It works as well with 64 bits.
But if I emulate the operations in single precision, then I can say the GPU matches the emulated results quite well too (except the second item):
#( 86312 67405 49652 32261 24232 ) collect: [:e |
| u r |
u := e asArbitraryPrecisionFloatNumBits: 24.
r = u*u ln.
r asTrueFraction printShowingMaxDecimalPlaces: 100]
-> #('980998.3125' '749440.75' '536876.8125' '334921.28125' '244632.453125')
So I'd say the CPU did probably use a double (or extended) precision to evaluate the log and perform the multiplication.
On the other side, the GPU did perform all the operations in single precision. Then the log function of ArbitraryPrecisionFloat package is correct to half ulp, but that's not a requirement of IEEE 754, so that can explain the observed mismatch on second item.
You may try to write the code so as to force float (like using logf instead of log if it's C99, or use intermediate results float ln=log(u); float r=u*ln;) and eventually use appropriate compilation flags to forbid extended precision (can't remember, I don't use C every day). But then you have very few guaranty to obtain 100% match on log function, the norms are too lax.