6

I am aware of the fact that the C standard allows string literals with identical content to be stored in different locations, at least that's what I have been told, and what I take away from other posts here on SO, e.g. this or this one. However it strikes me as odd that the equality of location for these literals is not required by the standard, since it would guarantee smaller executables and speed up equality checks on string literals considerably, making them an O(1) operation instead of O(n).

I would like to know what arguments - from an implementers POV - make it appealing to allow the locations of these literals to differ. Do compilers do any kind of optimization to make the saving on comparing the literal's location irrelevant? I am well aware, that doing such a comparison on the location would be useless if you would compare a literal with a variable pointing to a different location containing the same string, but I am trying to understand how people who make the standard look at this.

I can think of arguments why you would not want to do that, e.g. the subtle errors you might introduce, when you make a location based comparison an operation supported by the standard, but I am not entirely satisfied with what I could come up with.

I hope some of you can shed light on this.

Edit 1: First of all thank you for your answers and comments. Beyond that I would like to add some thoughts on some of the answers given:

@hvd: I think this is a problem for the specific additional optimization, not the idea of having a single instance per string literal.

@Shafik: I think your question makes it clear to me why having this set in stone would not allow for a lot of useful usages. It could only be used in code that is limited to the translations unit's scope. Once two files with the same string literal are compiled independently of each other, both would contain their own string literal at their own location. Objects would have to use an external reference or be recompiled every time they are combined with other objects containing the same literal.

I think I am sufficiently convinced that the less strict implementation spec as John Bollinger and FUZxxl suggested is preferable, given how little could be gained by JUST specifying that string literals should exist only once per translation unit.

Community
  • 1
  • 1
midor
  • 5,487
  • 2
  • 23
  • 52
  • 1
    My [answer to this question](http://stackoverflow.com/q/26279628/1708801) may be helpful. – Shafik Yaghmour Nov 24 '14 at 17:16
  • Because people like to write `char *foo="blah";`, and a lot of code exists that has such definitions in header files which are included into multiple translation units. Disallowing that would break a *lot* of existing code. – William Pursell Nov 24 '14 at 17:18
  • 2
    The standard does not require or prohibit optimizations of any kind. Why this one is suddenly important? – n. m. could be an AI Nov 24 '14 at 17:33
  • It is "necessary" today because it was not required "in the beginning" as that made for a bigger (slower) compiler. To require it now would break a few, if any, compilers that do not do this. A transition period would be needed (years) and the C community is not over-whelming for it. – chux - Reinstate Monica Nov 24 '14 at 17:35
  • If you want to pool the string literals there is nothing to stop you defining them once, say in an included file purely for that purpose. – Weather Vane Nov 24 '14 at 17:40
  • It strikes me odd that you would suppose the standard should *require* the locations of identical string literals to be the same. Any space saving is uncertain, and string comparisons are not helped because programs rarely compare two string *literals*. Given so little gain to be had, why would the standards committee saddle implementations with such a requirement? – John Bollinger Nov 24 '14 at 17:53
  • @WilliamPursell: Requiring identical string literals to be stored only once would not interfere with writing `char *foo="blah";`. It would just require some work at link time. Implementations are already *permitted* to store all instances of `"blah"` in one place. – Keith Thompson Nov 24 '14 at 19:13

2 Answers2

7

Aside from older compilers that simply want to avoid doing unnecessary work, the requirement would not necessarily be useful even today.

Suppose you have one translation unit with the string literals "a" and "ba". Suppose you also have an optimising compiler that notices this translation unit's "a" can be optimised to "ba"+1.

Now suppose you have another translation unit with the string literals "a" and "ca". The same compiler would then optimise that translation unit's "a" to "ca"+1.

If the first translation unit's "a" must compare equal to the second translation unit's "a", compilers cannot merge strings like this, even though this is a useful optimisation to save space. (As FUZxxl points out in the comments, some linkers do this, and if one of those linkers is used, the compiler don't need to. Not all linkers do this, though, so it may still be a worthwhile optimisation in the compiler.)

  • In other words, string merging must be done at link-time, which is precisely what an ELF linker does. Of course you still run into problems with shared libraries (by design, a string literal in a shared object cannot have the same address as an equal string literal elsewhere). – fuz Nov 24 '14 at 18:13
  • @FUZxxl Agreed, that's a valid approach too. But not all linkers perform such optimisations, and if the linker does not, that's still a good reason why the compiler might. –  Nov 24 '14 at 18:23
5

The C standard has been traditionally written in a way that makes writing a basic C compiler a comparably simple task. This is important because a C compiler is usually among the first things that need to be provided on a new platform due to the ubiquity of the C language.

For this reason, the C standard does

  • provide syntax like the register keyword to aid dumb compilers,
  • not mandate any optimizations,
  • not specify many aspects of its behaviour.
fuz
  • 88,405
  • 25
  • 200
  • 352