1

I was reading this book called The Shellcoder's Handbook and then I came across this information:

A modern computer makes no real distinction between instructions and data. If a processor can be fed instructions when it should be seeing data, it will happily go about executing the passed instructions. This characteristic makes system exploitation possible.

That's when I realised how vulnerable our modern systems are :( But this book is a bit old, so I did some research but couldn't find the enough information to satisfy me. Is there still not exist any method which could distinguish between Data and Instructions ?

Side Question A typical assembly file contains three segments: .text .bss and .data

Is there another way to distinguish between these ?? [Without Metadata]

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Da3kL10rd
  • 340
  • 2
  • 6
  • it is not possible, bits are bits. can the train wheels tell if the next section of track is made of steel or something else? if it is even present? certainly not, it wont know if it is in line and holds until the moment the wheel touches it. same with a processor it is not possible to tell data from instruction until the bits are loaded into the decoder and a decode is attempted. and a buggy program may actually have data in front of the processor and it will do its best to consume those bits as instructions because it is not possible to tell. Processors are very very very dumb.... – old_timer Dec 30 '21 at 19:04
  • the only way to properly determine instruction from data is to first assume the programmer and tools have built a proper program that conforms to the architecture and its rules. second you have to start from a known entry point then decode in execution order. because some of the instruction paths may be from user input, etc you might not be able to follow all of them, and cannot simply assume you know the full range of a jump table, etc simply by inspection. in the old days or perhaps even today, some branches were bogus to throw off disassembly, causing chaos. – old_timer Dec 30 '21 at 19:06
  • some binary files and file formats will allow for function labels, allowing the user/tools to have more entry points into the code. it is trivial to confuse a dissassembler, even gnu objdump by removing the labels and tossing in a byte or two into the path of the disassembler. – old_timer Dec 30 '21 at 19:08
  • modern computer has nothing to do with it, this is how logic works. you feed it bits and hope you did your job right in crafting the right bits, just like you lay down some train tracks and hope you did it right and it does not crash the train. – old_timer Dec 30 '21 at 19:08
  • 1
    The question quote seems inverted: did you mean "If a processor can be fed *data* when it should be seeing *instructions* it will happily go about executing the passed *data* [as instructions]" ? – Weather Vane Dec 30 '21 at 19:26
  • This question has been asked many times before on Stack Overflow; I closed it as a duplicate of the first 5 ones that looked good in search results. There's also Q&As like [x86 way to tell instruction from data](https://stackoverflow.com/q/4224789), but that's about reverse-engineering / disassembly, not code-injection attacks. And [How does a CPU know if an address in RAM contains an integer, a pre-defined CPU instruction, or any other kind of data?](https://stackoverflow.com/q/31172253) is also relevant with some good answers. – Peter Cordes Dec 30 '21 at 21:04
  • [How instructions are differentiated from data?](https://stackoverflow.com/q/2022489) is somewhat relevant, but I decided to drop it from the list of duplicates in favour of [How does a CPU know if an address in RAM contains an integer, a pre-defined CPU instruction, or any other kind of data?](https://stackoverflow.com/q/31172253) – Peter Cordes Dec 30 '21 at 21:11

3 Answers3

5

The short answer is no, unless you tell a processor what it is receiving it doesn't know the difference. If execution jumps there (and the page table allows exec permission), it's code. If load/store instructions access it as data, it's data. (Both can happen to the same memory address in a program that uses JIT compilation to write some machine code into a buffer and then jump there to run it.)

A subsection of data bits and instruction bits could be identical, and without full context there is no way of telling the difference.

You can extrapolate that to the extreme; is 01 part of some data or part of an instruction?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
yuuuu
  • 738
  • 3
  • 11
2

Instructions are data.  Before a program runs, the build tools and operating system treat the program as data, e.g. as a data file.  In order to load the program into memory, the OS treats the various parts of the program as data, including the .text section (those containing program code / machine code instructions).

For 8-bit data bytes, all possible bit patterns (all possible values) are used to convey meaning — e.g. they differentiate between 256 different ascii symbols, or represent numbers from 0..255 or -128..127 or, represent parts of some larger word encoding.  Since instructions are data, there's no inherent way for a processor or anyone else to differentiate between them.

That being said, however, modern operating systems and processors allow protection of sections, like .text, such that writing to that is prevented.

Is there another way to distinguish between these ?? [Without Metadata]

No, metadata is the approach.

If a processor can be fed instructions when it should be seeing data, it will happily go about executing the passed instructions.

This is cumbersome wording.  I don't know what it means to say "it should be seeing data".  I might say when the program is fooled into executing what the program intended as data.


But modern processors and operating systems have techniques that thwart this.  Marking the runtime/call stack as non-executable is one such approach.  The idea is to carry some of that program metadata over into the runtime using hardware features that effectively tag some memory regions with execute permission (or not).  Tagging regions or pages is more efficient than tagging individual machine code instructions or individual data bytes.

See more here: https://en.wikipedia.org/wiki/Buffer_overflow_protection

Erik Eidt
  • 23,049
  • 2
  • 29
  • 53
1

Code and data can be distinguished in machine level by execute bit in a page table. The determining factor can't be any bit pattern, but a (virtual) address.

Statistically speaking, it's possible to determine with reasonable confidence if a byte stream can be interpreted as meaningful code. Code contains a structure, such as data dependency flow, where any arbitrary or repetitive data doesn't; random data could be interpreted as mov eax, ebx, but typical code wouldn't contain many occurences of mov eax, ebx; mov eax, ecx; mov eax, edx, which would be a perfectly valid sequence of data.

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • 1
    There can be data in executable pages, e.g. a program compiled with `-zexecstack` (or with that implicitly enabled because you took a function pointer to a GNU C nested function). In ARM, it's normal for read-only "literal pools" of data to go between functions because ARM has a relatively short-ranged PC-relative load instruction, and toolchains evolved to take advantage of it that way. So an EXEC bit in a page table doesn't guarantee that bytes in the page are code, although lack of such a bit would rule out them executing as code at the moment, from this mapping. – Peter Cordes Dec 30 '21 at 20:48