Assembly function's data arrangment in data section

Question

I know that it's possible to keep a function's data near to it (for example, at its end) or just far from the function, in data section. Also, I know it's better to keep the data in data section (data like jmp-tables and ...) and just let's consider that we are keeping function's data in data section. Now, my question about how to arrange data (based on their size) in data section. For example, a function has a jmp-table (list of 8-byte addresses) and a lot of DWORD (4-BYTE) data and some WORD (2-BYTE) data with a lot of 1-BYTE data:

section code
func:
   ...
section data align 64
 func.jmp_table:
    DQ ...
    DQ ...
    DQ ...
    DQ ...
    DQ ...
    DQ ...
 func.data4:
    DD 0,1,2,3,4,...
 func.data2:
    DW 0,1,2,3,4,...
 func.data1:
    DB 0,1,2,3,4,...

So we put a function data in data section. But let's think we have 10 functions and each function has its own multi size data (QWORD,DWORD,WORD,BYTE, ...). Now my question is about how to put these data into data section. Which way is better? Putting each functions data near each other (QWORD,DWORD,WORD,BYTE) or just divide the data section to QWORDs, DWORDs, WORDs, BYTEs and arrange data based on their size ?

Way1 (putting each function's data back to back and on top of it, let's do 8-byte alignment):

section code
 func:
   ...
 func2:
   ...
 func3:
   ...
section data align 64
 func.jmp_table:
    DQ ...,...,...,...,...,...
 func.data4:
    DD ...,...,...,...,...,...
 func.data2:
    DW ...,...,...,...,...,...
 func.data1:
    DB ...,...,...,...,...,...

 align 8
 func2.jmp_table:
    DQ ...,...,...,...,...,...
 func2.data4:
    DD ...,...,...,...,...,...
 func2.data2:
    DW ...,...,...,...,...,...
 func2.data1:
    DB ...,...,...,...,...,...

 align 8
 func3.data1:
    DB ...,...,...,...,...,...
 func3.jmp_table:
    DQ ...,...,...,...,...,...
 func3.data4:
    DD ...,...,...,...,...,...
 func3.data2:
    DW ...,...,...,...,...,...
 func3.data1:
    DB ...,...,...,...,...,...

Way two (split each function's data based on its size and arrange data section based on size).

section code
     func:
       ...
     func2:
       ...
     func3:
       ...
    section data align 64
     func.jmp_table:
        DQ ...,...,...,...,...,...
     func2.jmp_table:
        DQ ...,...,...,...,...,...
     func3.jmp_table:
        DQ ...,...,...,...,...,...

     func.data4:
        DD ...,...,...,...,...,...
     func2.data4:
        DD ...,...,...,...,...,...
     func3.data4:
        DD ...,...,...,...,...,...

     func.data2:
        DW ...,...,...,...,...,...
     func2.data2:
        DW ...,...,...,...,...,...
     func3.data2:
        DW ...,...,...,...,...,...

     func.data1:
        DB ...,...,...,...,...,...
     func2.data1:
        DB ...,...,...,...,...,...
     func3.data1:
        DB ...,...,...,...,...,...

Putting data related to a function close together means either you give up alignment or may need more padding inserted by the `align` directives. You can align things to their size, eg `align 4` before `dd ...` and `align 2` before `dw ...`. If you put all eg `dw ...` items one after another only one single `align 2` will possibly emit any padding, additional `align 2` between `dw` directives will not emit any padding because the directives all preserve the 2-byte alignment. Either approach is valid. Grouping a function's data allows to place the section contents directly with the function. — ecm, May 23 '23 at 07:19
@ecm if together, we do not give up alignment since we start from greater (in size) and perform align at top of it. But how about cache line ? When data of a function be near each other, it's better for the cache-line. — HelloGUI, May 23 '23 at 07:31
CPUs use split L1 caches, so mixing code and data means you're wasting some L1i capacity on data and some L1d capacity on code. Like if you had more than 32 KiB total code + read-only data, you'd have misses in both caches. Also you're using up iTLB vs. dTLB capacity. There's a reason x86 compilers don't do that, because the disadvantages outweigh the benefits for real functions that are part of real programs that involve more than 1 cache-line of code+data anyway. — Peter Cordes, May 23 '23 at 09:03
It's short-sighted to optimize the behaviour of the cold-caches case for each function separately, at the expense of overall cache footprint. See also [Why do Compilers put data inside .text(code) section of the PE and ELF files and how does the CPU distinguish between data and code?](https://stackoverflow.com/q/55607052) (they don't, that was a misunderstanding by the asker of that question. Obfuscation tools might do that.) — Peter Cordes, May 23 '23 at 09:04
@PeterCordes Yes, but my question is about arranging data (split each function's data based on its size to `QWORDs, DWORDs, WORDs, BYTEs`) in data section or not ... and based on your comment (last one) i assumed you agree with my first way. Right ? — HelloGUI, May 23 '23 at 10:18
Oh, I was totally misreading the question after skimming it too quickly (sorry Mike Nakis, your answer was correct for the actual question). Yes, you should generally keep one function's data together. The total amount of space lost to padding is usually very small. Of course, in real programs most static data is shared across functions, with functions using stack space for their locals that they only need while they're running. That's more efficient for code-size as well because the addressing mode is smaller to reach a stack variable than RIP+rel32 for static storage. — Peter Cordes, May 23 '23 at 10:28
@PeterCordes I was agreed with Mike but not his last part which said "it's better to keep the code and data together". — HelloGUI, May 23 '23 at 10:34
Oh, yeah that was in his original answer; that's probably why I thought that's what the whole question was about. (Sometimes I skip down to an answer if someone's posted one, in case someone else has already made sense of a longish question.) — Peter Cordes, May 23 '23 at 10:58

Mike Nakis · Accepted Answer · 2023-05-23T08:16:59.990

2

Your decision should be based upon cache utilization.

In other words, data that are often accessed together should be placed as close as possible, so as to maximize the chances of falling in the same "cache line".

("Cache line" is a historical term which you can look up, it basically means "cache page", but the word page is already used for something else.)

edited May 23 '23 at 08:16

answered May 23 '23 at 07:20

Mike Nakis

56,297
11
110
142

I was looking for this answer `"In other words, data that are often accessed together should be placed as close as possible, so as to maximize the chances of falling in the same "cache line".` But about putting instructions and data together, we have `data-cache` and `i-cache`, so it's really not necessary these two be near each other (both in `text` section). Right ? Or I missed something ? – HelloGUI May 23 '23 at 07:29
1

Mike and @HelloGUI: important to mention that CPUs use split L1 caches, so mixing code and data means you're wasting some L1i capacity on data and some L1d capacity on code. Like if you had more than 32 KiB total code + read-only data, you'd have misses in both caches. Also you're using up iTLB vs. dTLB capacity. There's a reason x86 compilers don't do that, because the disadvantages outweigh the benefits for real functions that are part of real programs that involve more than 1 cache-line of code+data anyway. – Peter Cordes May 23 '23 at 07:49
See also [Why do Compilers put data inside .text(code) section of the PE and ELF files and how does the CPU distinguish between data and code?](https://stackoverflow.com/q/55607052) (they actually don't, for good reason.) – Peter Cordes May 23 '23 at 07:49
1

*First of all, it does not really matter.* - That's not necessarily true! If you put some read/write data in the same cache line as a loop that stores to it, you'll have a self-modifying-code pipeline nuke for every store. Also, mixing read-write data would require you to have write+exec permission on your pages, bad for security. Separating out read-only data in its own page allows it to be mapped without exec permission, reducing the surface area for Spectre gadgets, which is why modern `ld` does that. – Peter Cordes May 23 '23 at 07:53
@HelloGUI nothing is necessary. All I am saying is that if the CPU ever needs to access the jump-tables, it is doing so as a result of specific instructions in the code that reference the jump-tables, so it makes sense to put the two as close as possible. – Mike Nakis May 23 '23 at 07:55
@PeterCordes I noticed you added the [x86-64] tag to the question. Are the meager few keywords and directives in the code shown unique enough and sufficient to inform such an inference? Also, don't you think that this changes the nature of the question from "assembly in general" to "x86-64 assembly in specific" ? – Mike Nakis May 23 '23 at 08:06
Mike since I had `jmp-table`, and I used `DQ`, it shows that I am using x86-64, so Peter is right about that tag (x86-64). – HelloGUI May 23 '23 at 08:10
MikeNakis: Mostly based on the OP asking about x86-64 in their other question; I assumed that's what they intended to ask about. I'm aware that `dq` is probably valid ARMASM syntax, but that `section code` probably isn't correct for anything. (Ping @HelloGUI - `jmp-table` doesn't imply x86-64 at all, IDK why you think that. Any 64-bit ISA can have arrays of code pointers that they load from and jump to, that's a "jump table"). Stuffing read-only constants between functions is something compilers already do for ARM ("literal pools") because of the limited range of PC-relative loads. – Peter Cordes May 23 '23 at 08:11
Your edit removed any mention of only doing this for immutable data, which is a huge deal for x86-64 with its coherent I-cache and pipeline, thus machine clears on stores within the same cache line (or same page depending on microarchitecture) as any instruction in-flight. It's a non-issue on other ISAs, except that you do normally want your executable pages to be read-only. – Peter Cordes May 23 '23 at 08:28
Most modern CPUs across ISAs have split L1 caches. e.g. ARM started doing that over a decade ago, and are still doing that. And it's short-sighted to optimize the behaviour of the cold-caches case for each function separately, at the expense of *overall* cache footprint. Compiler behaviour (of separating code+data except on ARM) has larger programs in mind, where having most cache lines in both L1 caches would lead to more misses. – Peter Cordes May 23 '23 at 08:30

Assembly function's data arrangment in data section

1 Answers1