For x86, gcc/clang/ICC/MSVC don't mix data with code because it's pointless, like I said in my answer on the linked question. (Not counting immediate data, which would decode as part of an instruction, obviously). The end of the .text
section and the start of the .rodata
section might be adjacent inside the TEXT segment, but that's not what you mean.
For non-x86 ELF binaries (e.g. ARM), they do mix code and read-only data to allow PC-relative loads with only 12 bits or smaller offsets that fit into a fixed-width load instruction.
Obfuscated x86 binaries certainly might mix in some data, or just make disassembly hard so it looks like there might be some. Static disassembly is normally easy on compiler-generated code that hasn't been intentionally obfuscated. Anything that confuses disassembly can make it look like possible data. And yes, it's undecidable.
Nowhere in my linked answer did I say that binaries with mixed code + constants don't exist. I only said that normal optimizing compilers don't do it, and that it has no performance advantages. Only anti-reverse-engineering advantages, at a small cost in performance assuming the data is read-only. (Or a very large cost if data is read/write.)
Binary obfuscation is a real thing that people use on commercial software. I'm not at all surprised that you've found binaries in the wild that don't disassemble cleanly. But this is done after compiling, making a new obfuscated binary from compiler output. (Or maybe with compiler plugins? I'm really not sure). But it's not the compiler proper that's doing it, that's a later step in the build toolchain. People that sell binary-obfuscation software are selling a binary->binary converter, not a compiler, I think.
I've never had any trouble disassembling gcc/clang output on any Linux distro (e.g. stuff in /usr/bin or /usr/lib). Without debug symbols you get huge blocks of instructions, but disassembly doesn't get out of sync with how execution would reach it. Padding between functions is long NOPs that decode sanely after the ret
or jmp
at the bottom of a function. Or with MSVC, the padding is single-byte int3
instructions that again don't desync decoding of the start of the next function the way 00 00
bytes (add [rax], al
) would.
Notice the difference between your claim (that obfuscated binaries exist) vs. the much much stronger claim made in the paper linked from the other question (that optimizing compilers do this aggressively for performance reasons including on x86).
If you want to implement binary-rewriting that must work for every binary, then yes you have a huge problem. But if you only have to care about non-obfuscated compiler output, it's significantly easier.