The compiler will guarantee that the code will execute correctly, but it may sacrifice performance for correctness. Since the union is really only adding syntactic convenience for accessing the individual elements of a 4-item float vector, and the _mm128 object is (conceptually, if not actually) sitting in a register, I recommend you just use the _mm128 object directly and use the _mm_store_ps and _mm_load_ps family of APIs to move data in and out of the object.
Comments in the link you supplied suggest that the compiler can do poor optimization around the union, especially with _mm128s. If you want to be sure of this, you should do experiments both with and without the union. For high-resolution time measurement in Linux I recommend the pthread_getcpuclockid and clock_gettime APIs. Post your results if you can!
In general, for best performance, make things as easy and simple for the compiler as possible. This means keeping high-performance things like _mm128 out of complex structures like unions and instead just declare them on the stack or in memory allocated expressly for them.