17

How does compiler (MS Visual C++ 2010) combine identical string literals in different cpp source files? For example, if I have the string literal "hello world\n" in src1.cpp and src2.cpp respectively. The compiled exe file will have only 1 "hello world" string literal probably in the constant/readonly section. Is this task done by the linker?

What I hope to achieve is that I got some modules written in assembly to be used by C++ modules. And these assembly modules contain many long string literal definitions. I know the string literals are identical to some other string literals in the C++ source. If I link my assembly generated obj code with the compiler generated obj code, would these string literals be merged by the linker to remove redundant strings as is the case when all modules are in C++?

JavaMan
  • 4,954
  • 4
  • 41
  • 69
  • 1
    If your program is dependent on identical string literals occupying the same memory, you should redesign the program. – Mark Ransom Jun 08 '11 at 16:09
  • 5
    @MarkRansom: he never said his code is dependent on that; he's asking for what the expected behaviour is, which is entirely reasonable. – Paul Sonier Jun 08 '11 at 16:59
  • When in doubt, you can always perform the "merging" or "combining" yourself by defining string literal once and referring to it by name in other parts of the program. – Thomas Matthews Jun 08 '11 at 17:03
  • 1
    @Thomas Easier said than done. –  Jun 08 '11 at 17:04
  • @Paul Sonier, the expected behavior is that the compiler and linker can do whatever they want so long as valid code produces the expected output. It is a mistake to rely on unspecified behavior, even if it seems consistent, because compiler versions and option switches might change everything. I didn't say the question wasn't worth asking, just providing a warning. – Mark Ransom Jun 08 '11 at 17:08
  • 1
    @Neil, it's done all the time for the purposes of language translation. – Mark Ransom Jun 08 '11 at 17:10
  • @Mark Normally using things like Windows string resources, which are not the same kind of thing as C++ string literals. –  Jun 08 '11 at 17:15
  • But @Mark, the behavior in this case *is* specified, according to the `/GF` compiler switch. This question is asking what, if anything, needs to be done in an assembler file to ensure that "string literals" in the assembler file are compatible with string literals from other files so that Visual C++ can consider all of them for merging. – Rob Kennedy Jun 08 '11 at 17:20

6 Answers6

11

(Note the following applies only to MSVC)

My first answer was misleading since I thought that the literal merging was magic done by the linker (and so that the /GF flag would only be needed by the linker).

However, that was a mistake. It turns out the linker has little special involvement in merging string literals - what happens is that when the /GF option is given to the compiler, it puts string literals in a "COMDAT" section of the object file with an object name that's based on the contents of the string literal. So the /GF flag is needed for the compile step, not for the link step.

When you use the /GF option, the compiler places each string literal in the object file in a separate section as a COMDAT object. The various COMDAT objects with the same name will be folded by the linker (I'm not exactly sure about the semantics of COMDAT, or what the linker might do if objects with the same name have different data). So a C file that contains

char* another_string = "this is a string";

Will have something like the following in the object file:

SECTION HEADER #3
  .rdata name
       0 physical address
       0 virtual address
      11 size of raw data
     147 file pointer to raw data (00000147 to 00000157)
       0 file pointer to relocation table
       0 file pointer to line numbers
       0 number of relocations
       0 number of line numbers
40301040 flags
         Initialized Data
         COMDAT; sym= "`string'" (??_C@_0BB@LFDAHJNG@this?5is?5a?5string?$AA@)
         4 byte align
         Read Only

RAW DATA #3
  00000000: 74 68 69 73 20 69 73 20 61 20 73 74 72 69 6E 67  this is a string
  00000010: 00      

with the relocation table wiring up the another_string1 variable name to the literal data.

Note that the name of the string literal object is clearly based on the contents of the literal string, but with some sort of mangling. The mangling scheme has been partially documented on Wikipedia (see "String constants").

Anyway, if you want literals in an assembly file to be treated in the same manner, you'd need to arrange for the literals to be placed in the object file in the same manner. I honestly don't know what (if any) mechanism the assembler might have for that. Placing an object in a "COMDAT" section is probably pretty easy - getting the name of the object to be based on the string contents (and mangled in the appropriate manner) is another story.

Unless there's some assembly directive/keyword that specifically supports this scenario, I think you might be out of luck. There certainly might be one, but I'm sufficiently rusty with ml.exe to have no idea, and a quick look at the skimpy MSDN docs for ml.exe didn't have anything jump out.

However, if you're willing to put the sting literals in a C file and refer to them in your assembly code via externs, it should work. However, that's essentially what Mark Ransom advocates in his comments to the question.

j04n
  • 115
  • 6
Michael Burr
  • 333,147
  • 50
  • 533
  • 760
  • Also note that `/GF` puts each string into a COMDAT in a read-only section. If the linker has identical COMDAT folding (IDF) enabled, then duplicate strings will be consolidated. Note that IDF is based on the actual contents of the COMDAT as well as the attributes of the sections they're in. So, for example, if the assembly string is not in a read-only section, the linker won't consolidate it with ones generated by the compiler. Though the linker might use the name of the COMDAT as a first cut, the actual contents must be identical as well. – Adrian McCarthy Dec 27 '18 at 14:15
4

Yes, the process of merging the resources is done by the linker.

If your resources in your compiled assembly code are properly tagged as resources, the linker will be able to merge them with compiled C code.

Paul Sonier
  • 38,903
  • 3
  • 77
  • 117
  • The parsing portion of the compiler can merge resources within the same translation units; no need to wait for the linker. – Thomas Matthews Jun 08 '11 at 17:01
  • @Thomas, the parsing portion of the compiler is not involved in detecting literals defined in assembler code. – Rob Kennedy Jun 08 '11 at 17:21
  • The compiler not the linker. I can see only one occur in one obj file. When linking all object file to a so or dll or dylib, string literal in different object file not merged. – martian Dec 21 '17 at 12:26
3

Much may depend on the specific compiler, linker, and how you drive them. For example, this code:

// s.c
#include <stdio.h>

void f();

int main() {
    printf( "%p\n", "foo" );
    printf( "%p\n", "foo" );
    f();
}

// s2.c
#include <stdio.h>

void f() {
    printf( "%p\n", "foo" );
    printf( "%p\n", "foo" );
}

when compiled as:

gcc s.c s2.c

produces:

00403024
00403024
0040302C
0040302C

from which you can see the strings have only been coalesced in individual translation units.

1

Identical literals, within the same translation unit, are processed during the parsing phase. The compiler converts literals in tokens and stores them into a table (for simplicity, assume [token ID, value]). When the compiler encounters the literal the first time, the value is entered into the table. The next encounters use the same literal. When generating code, this value is placed into memory and then each access reads this single value (except for those cases where placing the value in the executable code more than once speeds up execution or shortens executable length).

Duplicate literals in more than one translation unit may be consolidated by the linker. All identifiers tagged with global access (i.e. visible from outside the translation unit) will be consolidated if possible. That means that the code will access only version of the symbol.

Some build projects place common or global identifiers into (resource) tables, which allow the identifiers to change without changing the executable. This is a common practice for GUIs that need to present text translated into different languages.

Be aware that with some compilers and linkers, they may not perform the consolidation by default. Some may require a command line switch (or an option). Check your compiler documentation to see how it handles duplicate identifiers or text strings.

Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
1

"/GF (Eliminate Duplicate Strings)"

http://msdn.microsoft.com/en-us/library/s0s0asdt.aspx

Steve-o
  • 12,678
  • 2
  • 41
  • 60
1

Assembly language doesn't provide any way to work directly with an anonymous string literal like C or C++ does.

As such, what you almost certainly want to do is define the strings in your assembly code with names. To use those from C or C++, you want to put an extern declaration of the array into a header that you can #include in whatever files need access to them (and in your C++ code, you'll use the names, not the literals themselves):

foo.asm

.model flat, c

.data
    string1 db "This is the first string", 10, 0
    string2 db "This is the second string\n", 10, 0

foo.h:

extern char string1[];
extern char string2[];

bar.cpp

#include "foo.h"

void baz() { std:::cout << string1; }
Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111