5

I need to get start and end address of following process segments : code, data,stack, environment. I understand how is it located in memory, but don't know how to get it using api calls or something else. I have found how to get start of some segments using this code

#include <stdio.h>

int temp_data = 100;
static int temp_bss;

void print_addr ( void )
{
        int local_var = 100;
        int *code_segment_address = ( int* ) &print_addr;
        int *data_segment_address = &temp_data;
        int *bss_address = &temp_bss;
        int *stack_segment_address = &local_var;

        printf ( "\nAddress of various segments:" );
        printf ( "\n\tCode Segment : %p" , code_segment_address );
        printf ( "\n\tData Segment : %p" , data_segment_address );
        printf ( "\n\tBSS : %p" , bss_address );
        printf ( "\n\tStack Segment : %p\n" , stack_segment_address );

}

int main ( )
{
        print_addr ();
        return 0;
}

But I don't know how to find end of each segment. I have only idea is that the end of one segment is the start of another segment. Please explain how can I do this using C and linux API.

pts
  • 80,836
  • 20
  • 110
  • 183
  • 1
    You want this from a process in memory? Or from an ELF executable on disk? Either way, the information is in the ELF headers. You should look at the ELF spec. – Jonathon Reinhart Oct 18 '14 at 21:59
  • Why do you ask? Is your application multi-threaded? – Basile Starynkevitch Oct 18 '14 at 22:03
  • You really should explain your motivation, and edit your question to explain why you need all that! – Basile Starynkevitch Oct 18 '14 at 22:11
  • 1
    I'm curious; why do you *need* to get these things? – Oliver Charlesworth Oct 18 '14 at 22:39
  • 1
    Your code to compute the beginning of the code segment is incorrect. It's hard to predict the order in which your code ends up in the executable but the linker will usually insert a PLT table before the first functions so you will never get the beginning of the code segment just by taking the address of a function. – fuz Oct 18 '14 at 22:43
  • 1
    The same also applies to your other code. Your `print_addr()` function will not be the first stack frame and there will be other variables in the data section before `temp_data`. Please consider consulting the output of `objdump` for what your executable will look like. – fuz Oct 18 '14 at 22:44
  • -1, because you did not edit your question to give your motivations. **Why do you need this?** – Basile Starynkevitch Oct 19 '14 at 05:44

4 Answers4

7

I'm not sure that the data or the heap segment is well defined and unique (in particular in multi-threaded applications, or simply in applications using dynamic libraries, including libc.so). In other words, there is no more any well defined start and end of text, data, or heap segment, because today a process has many such segments. So your question don't even make sense in the general case.

Most malloc implementations use mmap(2) and munmap much more than sbrk

You should read more about proc(5). In particular, your application could read /proc/self/maps (or /proc/1234/maps for process of pid 1234) or /proc/self/smaps; try cat /proc/self/maps and consider using fopen(3) on "/proc/self/maps" (then a loop on fgets or readline, and finally and quickly fclose). Perhaps dladdr(3) might be relevant.

You could also read the ELF headers of your program, e.g. of /proc/self/exe. See also readelf(1) and objdump(1) & execve(2) & elf(5) & ld.so(8) & libelf. Read also Levine's Linkers & Loaders book and Drepper's paper: How To Write Shared Libraries.

See also this answer to a related question (and also that question). Notice that recent Linux systems have ASLR, so the address layout of two similar processes running the same program in the same environment would be different.

Try also to strace(1) some simple command or your program. You'll understand a bit more the relevant syscalls(2). Read also Advanced Linux Programming

Community
  • 1
  • 1
Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • Thx for answer , can you give an example how to get address using proc. –  Oct 18 '14 at 22:06
  • What do you mean by "how to get address" – Basile Starynkevitch Oct 18 '14 at 22:08
  • I mean how to get start and end address , for example , of code segment –  Oct 18 '14 at 22:09
  • 3
    There is no more a single code segment. Run `cat /proc/self/maps` in a terminal and try to understand its output. – Basile Starynkevitch Oct 18 '14 at 22:10
  • I have just done it, as I understood here are heap, dynamic libraries, stack,system calls . But I don't understand what do three first lines means –  Oct 18 '14 at 22:13
  • Segments corresponding to the ELF executable e.g. `/bin/cat` – Basile Starynkevitch Oct 18 '14 at 22:14
  • 1
    oh (( I am so disappointed about this, I am newbie. Can you please provide an example of getting any segment addressed from this files ? Thx in advance –  Oct 18 '14 at 22:27
  • 4
    @ketazafor There is no such thing as a text and data segment on a modern Linux system anymore. tex is spread over multiple shared libraries and data is scattered around the address space from multiple `mmap()` calls. What you want simply does not exist anymore. – fuz Oct 18 '14 at 22:35
  • 2
    @ketazafor: All the relevant information to start from is in this answer. The details are for you to discover and understand. – pts Oct 18 '14 at 22:42
4

See man 3 end for some help:

#include <stdio.h>
extern etext;
extern edata;
extern end;
int
main(int ac, char **av, char **env)
{
        printf("main %p\n", main);
        printf("etext %p\n", &etext);
        printf("edata %p\n", &edata);
        printf("end %p\n", &end);
        return 0;
}

The addresses of those 3 symbols are the first address after the end of the text, initialized data, and uninitialized data segments.

You can get at the enivonrment variables via a 3rd parameter to main() as in the example code above, but you can also walk up the stack starting with the address &argv[0]. There's a NULL value word (32 or 64 bit depending on CPU) after the last pointer to a command line argument string. After that NULL lies the environment.

The top of the stack is near impossible to get programmatically - modern OSes all do "Address Space Layout Randomization" (ASLR) to provide some mitigation of buffer overflows. The "end" of the stack is hazy, as you can allocate on the stack (via recursion or alloca()) until you run into the top of the heap. So the "end" of the stack depends on allocation patterns of the program in question.

You should also be aware of ELF auxilliary vector. See man getauxval for a C language interface, and this article for some explanation. User programs never have a use for the ELF auxilliary vector, but it's intimately tied up with dynamic linking.

  • This is not very relevant for current Linux systems and dynamically linked executables and multi-threaded processes. It was more relevant in the previous century. – Basile Starynkevitch Oct 19 '14 at 05:28
  • Notice that the portable way to access the environment is through the global variable `environ`. – fuz Oct 19 '14 at 10:21
1

As said in another comment, the notion of a text, data, and stack segment does not really exist on Linux today. Program text is spread over shared libraries and memory allocation is done with mmap() instead of brk() causing the allocated data to be spread out all over the address space of a program.

That said, you can use the brk() system call to find the end of the data segment and you can use the symbols etext, edata, and end to find the boundaries of the executable. The beginning of the text segment is traditionally fixed (also called the “loading address”) and depends on the architecture and linker configuration. Notice that your program will most likely execute code outside the text section of your binary and will most likely not allocate any dynamic memory with brk.

See the corresponding man pages for more details.

fuz
  • 88,405
  • 25
  • 200
  • 352
0

Current flavours of Windows and Linux use flat address spaces, meaning that code and data segments are the same and pretty much always go from 0 to 2^32-1 (for 32-bit systems) and 2^64-1 (for 64-bit systems). Different processes usually have completely different address spaces, apart from shared memory. Usually only some parts of the address spaces have any memory mapped to it, and some parts may not even be addressable due to hardware limitations.

The linker's code and data segments become the sections of the runnable image, and the ELF format common under Linux adds some ulterior complications to that. Access is highly OS-specific and thus not really a C++ issue.

Under Windows you can get a pointer to the start of the loaded image via GetModuleHandle(0). By walking the executable headers you can find the COFF section table, which allows you to reverse-map all addresses that are part of the mapped executable image to their respective sections. Classifying other addresses is more difficult; they may belong to other mapped run images (loaded DLLs) or they may belong to address ranges that were allocated in another way, i.e. directly via VirtualAlloc() or indirectly in some way (HeapAlloc(), memory-mapped files, whatever).

If you only want to print nice stack traces or whatever then there are plenty of ready-made libraries that can do that for you. If you want to do checksums then things get a lot more complicated; better to use code-signing or ready-made libraries. The real answer to your question depends on what your real problem actually is...

DarthGizka
  • 4,347
  • 1
  • 24
  • 36