0

Let say I write a program which contains many functions/methods. In this program some functions are used many times as compared to others. In this case does the positioning of a function/method matters in terms of altering the speed at lower level(memory).

As currently, I am learning Computer Organization & Architecture, so this doubt arrived in my mind.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847

1 Answers1

4

RAM itself is "flat", equal performance at any address (except for NUMA local vs. remote memory in a multi-socket machine, or mixed-size DIMMs on a single socket leading to only partial dual-channel benefits1).

i-cache and iTLB locality can make a difference, so grouping "hot" functions together can be useful even if you don't just inline them.

Locality also matters for demand paging of code in from disk: If a whole block of your executable is "cold", e.g. only needed for error handling, program startup doesn't have to wait for it to get page-faulted in from disk (or even soft page faults if it was hot in the OS's pagecache). Similarly, grouping "startup" code into a page can allow the OS to drop that "clean" page later when it's no longer needed, freeing up physical memory for more caching.

Compilers like GCC do this, putting CRT startup code like _start (which eventually calls main) into a .init section in the same program segment (mapping by the program loader) as .text and .fini, just to group startup code together. Any C++ non-const static-initializer functions would also go in that section.


Footnote 1: Usually; IIRC it's possible for a computer with one 4G and one 8G stick of memory to run dual channel for the first 8GB of physical address space, but only single channel for the last 4, so half the memory bandwidth. I think some real-life Intel chipsets / CPUs memory controllers are like that.

But unless you were making an embedded system, you don't choose where in physical memory the OS loads your program. It's also much more normal for computers to use matched memory on multi-channel memory controllers so the whole range of memory can be interleaved between channels.

BTW, locality matters for DRAM itself: its laid out in a row/column setup, and switching rows takes an extra DDR controller command vs. just reading another column in the same open "page". DRAM pages aren't the same thing as virtual-memory pages; a DRAM page is memory in the same row on the same channel, and is often 2kiB. See What Every Programmer Should Know About Memory? for more details than you'll probably ever want about DDR DRAM, and some really good stuff about cache and memory layout.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Addressing the issue from the opposite direction, there have also been attempts (at least in research compilers) to utilize *outlining* of "cold" code, and grouping this outlined code into separate pages as part of layout optimizations. To my quite limited knowledge this helped significantly with code size reductions but only a little (single digit percentages) for performance optimization. – njuffa Jan 31 '21 at 22:51
  • @njuffa: via profile-guided optimization? Yeah that would make sense. I'd hope that `gcc -fprofile-use` and similar things for other compilers could put cold functions in a `".text.cold"` section or something like that. It already identifies cold vs. hot *loops* to avoid bloating the code vectorizing or unrolling cold loops. – Peter Cordes Jan 31 '21 at 23:48
  • I don't recall, as I read about this about twenty years ago. I would *assume* they used a mixture of profiling, heuristics, and explicit programmer annotations (e.g. attributes for branch likelihood) to determine which code was "cold". – njuffa Jan 31 '21 at 23:57
  • 1
    Yes, this mixed channel thing is called [flex mode](https://www.intel.ca/content/www/ca/en/support/articles/000005657/boards-and-kits.html#flex) and I think most Intel memory controllers support it. A big reason that RAM would have different apparent performance would be NUMA, although of course the RAM itself may perform identically: it's just that the some DRAM appears slower to the remote socket, etc. – BeeOnRope Feb 01 '21 at 08:01
  • 1
    @njuffa - yeah you can definitely find outlining in non-research compilers too. – BeeOnRope Feb 01 '21 at 08:02