Low-level details on linking and loading of (PE) programs in Windows

Question

Low-level details on linking and loading of (PE) programs in Windows.

I'm looking for an answer or tutorial that clarifies how a Windows program are linked and loaded into memory after it has been assembled.

Especially, I'm uncertain about the following points:

After the program is assembled, some instructions may reference memory within the .DATA section. How are these references translated, when the program is loaded into memory starting at some arbitrary address? Does RVA's and relative memory references take care of these issues (BaseOfCode and BaseOfData RVA-fields of the PE-header)?
Is the program always loaded at the address specified in ImageBase header field? What if a loaded (DLL) module specifies the same base?

*"Is the program always loaded at the address specified in ImageBase header field? What if a loaded (DLL) module specifies the same base?"* - Doesn't that second question already answer the first? — IInspectable, Aug 07 '16 at 18:11
Code *and* global data reference are relocated when the DLL can't be loaded at its preferred base address. — Hans Passant, Aug 07 '16 at 18:56
@HansPassant - could you point me to a resource on the subject. I can't seem to find anything but very high level descriptions. — Shuzheng, Aug 07 '16 at 19:11
I just gave you a low-level description. I leave it up to Google to let people find resources. — Hans Passant, Aug 07 '16 at 19:16
I tried but can't find something on MSDN or the like. The Wiki is a bit too high level (general) in its description, and the only tutorial I found was for Linux. — Shuzheng, Aug 07 '16 at 19:18
You'll probably want to read the [official PE specification](https://msdn.microsoft.com/en-us/windows/hardware/gg463119.aspx). — 500 - Internal Server Error, Aug 07 '16 at 20:05

Mihai · Answer 1 · 2018-07-16T22:33:14.237

First I'm going to answer your second question: No, a module (being an exe or dll) is not allways loaded at the base address. This can happen for two reasons, either there is some other module already loaded and there is no space for loading it at the base address contained in the headers, or because of ASLR (Address Space Layout Randomization) which mean modules are loaded at random slots for exploit mitigation purposes.

To address the first question (it is related to the second one): The way a memory location is refered to can be relative or absolute. Usually jumps and function calls are relative (though they can be absolute), which say: "go this many bytes from the current instruction pointer". Regardless of where the module is loaded, relative jumps and calls will work.

When it comes to addressing data, they are usually absolute references, that is, "access these 4-byte datum at this address". And a full virtual address is specified, not an RVA but a VA.

If a module is not loaded at its base address, absolute references will all be broken, they are no longer pointing to the correct place the linker assumed they should point to. Let's say the ImageBase is 0x04000000 and you have a variable at RVA 0x000000F4, the VA will be 0x040000F4. Now imagine the module is loaded not at its BaseAddress, but at 0x05000000, everything is moved 0x1000 bytes forward, so the VA of your variable is actually 0x050000F4, but the machine code that accessess the data still has the old address hardcoded, so the program is corrupted. In order to fix this, linkers store in the executable where these absolute references are, so they can be fixed by adding to them how much the executable has been displaced: the delta offset, the difference between where the image is loaded and the image base contained in the headers of the executable file. In this case it's 0x1000. This process is called Base Relocation and is performed at load time by the operating system: before the code starts executing.

Sometimes a module has no relocations, so it can't be loaded anywhere else but at its base address. See How do I determine if an EXE (or DLL) participate in ASLR, i.e. is relocatable?

For more information on ASLR: https://insights.sei.cmu.edu/cert/2014/02/differences-between-aslr-on-windows-and-linux.html

There is another way to move the executable in memory and still have it run correctly. There exists something called Position Independent Code. Code crafted in such a way that it will run anywhere in memory without the need for the loader to perform base relocations. This is very common in Linux shared libraries and it is done addressing data relatively (access this data item at this distance from the instruction pointer).

To do this, in the x64 architecture there is RIP-relative addressing, in x86 a trick is used to emulate it: get the content of the instruction pointer and then calculate the VA of a variable by adding to it a constant offset. This is very well explained here: https://www.technovelty.org/linux/plt-and-got-the-key-to-code-sharing-and-dynamic-libraries.html

I don't think PIC code is common in Windows, more often than not, Windows modules contain base relocations to fix absolute addresses when it is loaded somewhere else than its prefered base address, although I'm not exactly sure of this last paragraph so take it with a grain of salt.

More info:

http://opensecuritytraining.info/LifeOfBinaries.html

How are windows DLL actually shared? (a bit confusing because I didn't explain myself well when asking the question).

https://www.iecc.com/linker/

I hope I've helped :)

Low-level details on linking and loading of (PE) programs in Windows

1 Answers1