4

With the standard C++ entry:

int main(int argc, char *argv[])
{
    // stuff
}

How is argv populated? The compiler has no idea what size to allocate for an array and I would assume the OS is the entity responsible for passing the additional arguments to the program, but how are they passed to main? Where is the array of pointers initialized? Is that function created by the compiler and then injected into the program launch sequence?

This is something I've always just taken for granted, and I got to thinking about it on a problem today, that I wasn't really sure how the additional arguments are eventually received by main, let alone given to any program such as in CPython as sys.argv.

Bonus: How does the OS handle command line arguments? Clearly the CLI (or shell) knows how to parse the string sequence, but how are the additional arguments "fed into" the executable? Does the compiler add some functionality to to just read from stdin (which is a buffer) and parse the parameters accordingly before passing to main?

  • 1
    Related: [How do command line arguments work?](https://stackoverflow.com/q/9376035/11082165) – Brian61354270 Dec 04 '21 at 02:49
  • When I implemented this functionality for a toy OS in a college project, it was just a matter of mallocing() a big enough chunk of memory to hold both the pointer-array and the strings it needed to point to, then copying the string-data over and setting the pointers to point into it. I'm sure real OS's do something more elaborate, but it isn't rocket science. – Jeremy Friesner Dec 04 '21 at 03:02
  • @JeremyFriesner and that may answer how the OS parses the command line and stores them, but it doesn't answer how the OS then injects them into a binary executable to pass into `main`. My guess would be that the OS runs the executable, which outputs the program into memory, then it has access to `main` to pass it the array. –  Dec 04 '21 at 03:04
  • @madeslurpy the OS doesn't modify the executable; but it does set up the memory-space where the executable will run. Allocating and populating the argv-array within that memory space is part of that setup, which is done before `main()` is called. – Jeremy Friesner Dec 04 '21 at 03:05
  • Btw don't let the `char *[]` syntax fool you, it's logically equivalent to `char **`, i.e. the argument is just a pointer, and therefore the size of the array it points does not have to be known at compile-time. – Jeremy Friesner Dec 04 '21 at 03:08
  • @JeremyFriesner Fair, which makes sense why its passed into main and `argc` is there, still very curious about how it even gets to `main` though. –  Dec 04 '21 at 03:12
  • Matt Goldbolt gives a really interesting presentation on the subject of process setup in Linux, here: https://www.youtube.com/watch?v=dOfucXtyEsU – Jeremy Friesner Dec 04 '21 at 03:12
  • @JeremyFriesner If you removed the "Linux" portion of it, when somebody was writing a compiler specific to Windows, they must've known how `argv` was going to be passed in such that the grammar of `main` was correct in the implementation. Right? –  Dec 04 '21 at 03:21
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/239820/discussion-between-jeremy-friesner-and-madeslurpy). – Jeremy Friesner Dec 04 '21 at 03:21

3 Answers3

4

Let's take Linux x86-64 as an example.

When a process calls execv("/my/prog", args), it makes a system call to the kernel. The kernel uses the args pointer to locate the argument strings in the process's memory, copies them somewhere else for temporary safekeeping, and then tears down the process's virtual memory. Then it sets up the virtual memory for the new program, and loads its code and data from its binary /new/prog (actually it just maps it for demand loading, but that's not important).

It also allocates a block of memory to be the new program's stack, and that's where it copies the command line arguments, as well as the environment variables and various other data that needs to be passed to the new program. Here it also sets up the array of argv pointers, pointing to the strings themselves in the program's stack memory, and pushes the argument count on the stack as well. The precise layout is specified in the ABI, see Figure 3.9.

Now to actually start the program. The binary's header specifies an address to be used as an entry point. The linker will have arranged that this points to a special piece of startup code. This code usually comes with your standard C library, in an object file with a name like crt0.o. It has been written in assembly, and its job is to process the command line arguments and so forth, set up registers and memory the way that compiled C or C++ code expects, and call a C/C++ function in the standard library which will do further initialization and then call your main. The kernel jumps to the entry point address, switching to unprivileged mode along the way, and the startup code starts executing.

You can see glibc's version in start.S, but a very minimal version could look something like this.

; main takes argc in rdi and argv in rsi

; bottom of stack contains argument count
mov rdi, [rsp]

; next is start of the argument pointer array
lea rsi, [rsp+8]

call main

; main returns, exit the program
mov rdi, rax
call exit
; exit() makes an exit system call and doesn't return

So when control actually reaches your main function, the registers contain the same values as if it had been called by another C++ function. The argv argument points to an array of pointers on the stack, each of which points to a string located further up in stack memory, as set up by the kernel.

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
1

For linux ELF process:

It begin at linux kernel: create_elf_tables

/* Now, let's put argc (and argv, envp if appropriate) on the stack */
if (put_user(argc, sp++))
    return -EFAULT;

/* Populate list of argv pointers back to argv strings. */
p = mm->arg_end = mm->arg_start;
while (argc-- > 0) {
    size_t len;
    if (put_user((elf_addr_t)p, sp++))
        return -EFAULT;
    len = strnlen_user((void __user *)p, MAX_ARG_STRLEN);
    if (!len || len > MAX_ARG_STRLEN)
        return -EINVAL;
    p += len;
}
if (put_user(0, sp++))
    return -EFAULT;
mm->arg_end = p;

/* Populate list of envp pointers back to envp strings. */
mm->env_end = mm->env_start = p;
while (envc-- > 0) {
    size_t len;
    if (put_user((elf_addr_t)p, sp++))
        return -EFAULT;
    len = strnlen_user((void __user *)p, MAX_ARG_STRLEN);
    if (!len || len > MAX_ARG_STRLEN)
        return -EINVAL;
    p += len;
}
if (put_user(0, sp++))
    return -EFAULT;
mm->env_end = p;

use at glibc, sysdeps/x86_64/start.S

  45    %rsp         The stack contains the arguments and environment:
  46                 0(%rsp)                         argc
  47                 LP_SIZE(%rsp)                   argv[0]
  48                 ...
  49                 (LP_SIZE*argc)(%rsp)            NULL
  50                 (LP_SIZE*(argc+1))(%rsp)        envp[0]
  51                 ...
  52                                                 NULL
-1

How is argv populated?

The language implementation (by which I mean everything beyond, such as the shell, the operating system etc.) takes care of it.

I would assume the OS is the entity responsible for passing the additional arguments to the program

Pretty much, yes.

Where is the array of pointers initialized?

Somewhere that the language implementation chose to initialise them.

Bonus: How does the OS handle command line arguments?

There is no one OS. There are many, and each do their own thing. Some of them are open source, so you will be able to study them.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • 1
    Assuming the following: language implementation is C++ (whatever the standard one is) and OS is Windows. An answer in the form of "they all do it in their own way, somehow" is extremely vague, even if valid, and a working example for a standard event sequence in any combination is enlightening. Otherwise its like saying "electricity flows just because of how its medium works". –  Dec 04 '21 at 02:55
  • @madeslurpy I don't know how windows works and I cannot find out since I don't have the source code. `is extremely vague` It corresponds with the vagueness of the question. – eerorika Dec 04 '21 at 02:57
  • Unfortunately, you're out of luck. MS Windows is a commercial operating system with most of its source code kept under lock and key, as proprietary information. There's a slight chance that the documentation for this is buried somewhere within MSDN, maybe in parts of it that costs real $$ to purchase access to. Or, Microsoft may not document it at all. Since something like this is not necessary to write working software for MS Windows, there's not much of a reason for Microsoft to publish any public documentation for it. – Sam Varshavchik Dec 04 '21 at 02:58
  • Well you may not know Windows, but surely you know some combination of how they work together. And if so, why not share that? To be less pedantic, lets go with Unix as the OS? –  Dec 04 '21 at 02:58
  • 1
    @SamVarshavchik And if thats the case, so be it. To whit, I am not asking for actual source implementation, but the general theory behind its implementation. An executable is a closed binary file, so how do command line arguments get "into" the executable for use in `main`? –  Dec 04 '21 at 03:00
  • 1
    @madeslurpy most of [Nate's answer](https://stackoverflow.com/a/70223489/65863) is similar on Windows, too. Key differences are that when Windows creates a new process, the command line is copied into the process' [`PEB`](https://docs.microsoft.com/en-us/windows/win32/api/winternl/ns-winternl-peb) header, and then the program's runtime library's **startup code** retrieves the command line via [`GetCommandLineA()`](https://docs.microsoft.com/en-us/windows/win32/api/processenv/nf-processenv-getcommandlinea) and splits it up into the `char[]` array that is passed to the `main()` function – Remy Lebeau Dec 04 '21 at 08:44