I have seen a lot of assembly with AVX(all three flavors), and in all the cases that I have seen the most concentrated a kind of instruction is the best the code performs. But, for example, things like doing a load into a 32-bit register and then broadcasting into all lanes for AVX512 or AVX2 say...isn't it better better/faster to just load a register, and then permute away?
In particular, is there a penalty for mixing x86_64 and the AVX family of instructions? I know there is such a penalty for SSE and AVX mixing in general.
I know I could test this, but I'd rather poke the knowledge of the masses before I take on that mini-project. I am sure someone already knows this.