The C++ memory model does not give you a free pass for writing the same value.
If two threads are writing to a non-atomic object without synchronization, that is simply a race condition. And a race condition means your program executes undefined behavior. And undefined behaviour occuring anywhere in your program's execution means that the behavior of your program, both before and after the point of undefined behavior, is not restricted by the C++ standard in any way.
A given compiler is free to provide a more free memory model. I'm unaware of any that do.
One thing you must understand is that C++ is not an assembler macro language. It doesn't have to produce the naive assembler you imagine in your head. C++ instead tries to make it easy for your compiler to produce assembler, which is a very different thing.
Compilers can and do determine "if X happens, we get undefined behavior; so I'll optimize around the fact that X does not happen" when generating code. In this case here, the compiler can prove that program with defined behavior could ever have the same val
in two different unsynchrnoized threads.
All of this can happen long before any assembly is generated.
And at the assembly level, some hardware might do funny things with unaligned assignment to multi-byte values. Some hardware could (in theory; I'm unaware of any in practice) raise traps when instructions that claim to be single-thread writes occur in two different cores on the same bytes.
So this is UB in C++. And once you have UB, you have to audit the assembly code produced by your program in everywhere the compiler who touches this can see. If you do LTO, that means in your entire program, at least everywhere that calls or interacts with your code that does UB, to an unclear distance.
Just write defined behavior. And only if this turns out to be a mission critical performance bottleneck should you spend more effort on optimizing it (first faster defined behavior, and only if that fails do you even consider UB).