26

Is the code below safe? It might be tempting to write code akin to this:

#include <map>

const std::map<const char*, int> m = {
    {"text1", 1},
    {"text2", 2}
};

int main () {
    volatile const auto a = m.at("text1");
    return 0;
}

The map is intended to be used with string literals only.

I think it's perfectly legal and seems to be working, however I never saw a guarantee that the pointer for the literal used in two different places to be the same. I couldn't manage to make compiler generate two separate pointers for literals with the same content, so I started to wonder how firm the assumption is.

I am only interested whether the literals with same content can have different pointers. Or more formally, can the code above except?

I know that there's a way to write code to be sure it works, and I think above approach is dangerous because compiler could decide to assign two different storages for the literal, especially if they are placed in different translation units. Am I right?

luk32
  • 15,812
  • 38
  • 62
  • 1
    *"however I never saw a guarantee that the pointer for the literal used in two different places to be the same"* - There's a very good reason for that – StoryTeller - Unslander Monica Sep 20 '18 at 11:14
  • 1
    Why not use `std::string`? – tkausl Sep 20 '18 at 11:15
  • 1
    @tkausl Because that's beyond the scope of question as explicitly mentioned. I know how to write it properly. Also, because `std::string` ctor is not `constexpr`. – luk32 Sep 20 '18 at 11:27
  • @StoryTeller That is teasing. Please share! – luk32 Sep 20 '18 at 11:28
  • see [String Literal address across translation units](https://stackoverflow.com/q/26279628/1708801) – Shafik Yaghmour Sep 20 '18 at 13:11
  • 2
    @luk32 As a side note, your code is completely legal, it just might not do what you expect (i.e. `std::terminate` might be called due to an uncaught exception). – Arne Vogel Sep 20 '18 at 14:43
  • If all you work with are string literals, consider using an enum. – IS4 Sep 20 '18 at 16:55
  • Learn about the C++ term "undefined behavior" (and related terms). You probably aren't only interested in whether the behaviour here is defined--you probably want to know whether it is defined and "works". But you don't clearly say what "works" means. [mcve] – philipxy Sep 21 '18 at 01:13
  • @philipxy Dude... The code is there, the formal question is whether it can throw, it's also there. I am pretty sure the behaviour is defined (or at least implementation specific), that is what I meant by "*perfectly legal*". If I suspected UB i'd say about it, because asking about any behaviour after invoking UB makes no sense. **How do you precisely suggest I improve it further?** Currently you seem pretty condescending, assuming I have no idea about UB and asking for MCVE in front of code and clear question about it... This must be some misunderstanding... – luk32 Sep 21 '18 at 08:10
  • Possible duplicate of [How can different strings have the same address](https://stackoverflow.com/q/49135761/608639), [Addresses of two char pointers to different string literals are same](https://stackoverflow.com/q/19088153/608639), [Same strings in array have same memory address](https://stackoverflow.com/q/26433563/608639), etc. – jww Oct 17 '18 at 07:19

4 Answers4

21

Whether or not two string literals with the exact same content are the exact same object, is unspecified, and in my opinion best not relied upon. To quote the standard:

[lex.string]

16 Evaluating a string-literal results in a string literal object with static storage duration, initialized from the given characters as specified above. Whether all string literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.

If you wish to avoid the overhead of std::string, you can write a simple view type (or use std::string_view in C++17) that is a reference type over a string literal. Use it to do intelligent comparisons instead of relying upon literal identity.

StoryTeller - Unslander Monica
  • 165,132
  • 21
  • 377
  • 458
  • In production I did use a custom wrapper type around `const char*` and overloaded relevant operators. I wondered if I wrote superfluous code. In my particular case the string overhead is not much of a problem, it's about `constexpr` ctor, also I am still restricted to c++14. – luk32 Sep 20 '18 at 12:35
  • @luk32 - You can write a wrapper geared at string literals explicitly (taking a `char*` necessitates a length check). Consider [this for example](http://coliru.stacked-crooked.com/a/efb7c506a2b1228a) instead. It essentially gets rid of all the overhead, except for the comparison itself. – StoryTeller - Unslander Monica Sep 20 '18 at 12:49
  • @AndreasRejbrand: Absolutely! – Bathsheba Sep 20 '18 at 14:01
  • @StoryTeller I did something similar, after polishing it's nearly identical, though without constexpr size (I wasn't smart enough with the c'tor). Interestingly enough. It shaves off two asm instructions somehow from compile-time bimap lookup: https://godbolt.org/z/ZCMUKv. I think it's related to passing the size for some reason. – luk32 Sep 20 '18 at 14:19
  • @luk32 - kewl! :) I suspect your reduced object size plays a part in that. – StoryTeller - Unslander Monica Sep 20 '18 at 14:20
  • @luk32 - Though you know. If you take a traits class, you can build `compare` over `Traits::compare`. That was my thinking originally before I got too lazy. – StoryTeller - Unslander Monica Sep 20 '18 at 14:22
  • @luk32 - Yup. Deleting the `_len` member produces the exact same ASM https://godbolt.org/z/ECVAp3 – StoryTeller - Unslander Monica Sep 20 '18 at 14:25
  • @StoryTeller Yea. I confirmed it's the size of the string as well. I'm not sure why it's needed there. I might work on generalizing for other types, personally i'm interested in "smart" enums, so keeping it compile time constant is a priority for me, and it's good enough at this point. Thanks, nice help, I think we got a useful piece of code. – luk32 Sep 20 '18 at 14:27
  • @luk32 - I retained it for `Traits::compare`. It's expected to take a size argument. But I suppose one can do without. – StoryTeller - Unslander Monica Sep 20 '18 at 14:28
  • Yea, it's a design decision whether to use `strncmp` or `strcmp`. For literals I guess it's okey, but I don't think it's possible to restrict usage, and those two asm instructions might prevent some buffer overflows. This implementation of lookup has it's own problems if it goes to runtime. – luk32 Sep 20 '18 at 14:38
21

The Standard does not guarantee the addresses of string literals with the same content will be the same. In fact, [lex.string]/16 says:

Whether all string literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.

The second part even says you might not get the same address when a function containing a string literal is called a second time! Though I've never seen a compiler do that.

So using the same character array object when a string literal is repeated is an optional compiler optimization. With my installation of g++ and default compiler flags, I also find I get the same address for two identical string literals in the same translation unit. But as you guessed, I get different ones if the same string literal content appears in different translation units.


A related interesting point: it's also permitted for different string literals to use overlapping arrays. That is, given

const char* abcdef = "abcdef";
const char* def = "def";
const char* def0gh = "def\0gh";

it's possible you might find abcdef+3, def, and def0gh are all the same pointer.

Also, this rule about reusing or overlapping string literal objects applies only to the unnamed array object directly associated with the literal, used if the literal immediately decays to a pointer or is bound to a reference to array. A literal can also be used to initialize a named array, as in

const char a1[] = "XYZ";
const char a2[] = "XYZ";
const char a3[] = "Z";

Here the array objects a1, a2 and a3 are initialized using the literal, but are considered distinct from the actual literal storage (if such storage even exists) and follow the ordinary object rules, so the storage for those arrays will not overlap.

aschepler
  • 70,891
  • 9
  • 107
  • 161
  • 2
    I was just thinking how to cook up some pathological literal. A simple `\0` will do, +1. – StoryTeller - Unslander Monica Sep 20 '18 at 11:40
  • 2
    If your example used `const char *abcdef = "abcdef";` etc. that optimization would be legitimate, but code would be allowed to compare the addresses of `def` and `def0gh`, and the Standard specifies that they would be observably different. – supercat Sep 20 '18 at 21:16
  • @supercat Oops, that's absolutely correct. Fixed, thanks. – aschepler Sep 20 '18 at 22:11
  • @aschepler I suggest to incorporate the comment and code before the fix into this answer, IMHO it's an interesting technicality. I accept this one as a consolation for lower score (despite being pratically head to head in time), and more importantly for giving nice nifty counter examples. – luk32 Sep 21 '18 at 08:15
  • Sure, added in another example with named arrays again, and explained why the behavior is different. – aschepler Sep 21 '18 at 10:50
  • "are all the same pointer" is probably not what you meant. They could be pointers into a contiguous – Artelius Jul 24 '20 at 02:24
  • @Artelius Well, yes, "all evaluate to the same pointer value, after an array-to-pointer conversion where applicable". But I don't see the less precise wording causing relevant confusion here? – aschepler Sep 19 '20 at 14:00
5

No, the C++ standard makes no such guarantees.

That said, if the code is in the same translation unit then it would be difficult to find a counter example. If main() is in a different translation then a counter example might be easier to produce.

If the map is in a different dynamic linked library or shared object then it's almost certainly not the case.

The volatile qualifier is a red herring.

Bathsheba
  • 231,907
  • 34
  • 361
  • 483
  • I used `volatile` for copy-paste friendliness with online compilers, to prevent complete removal of code. I think it does work this way and doesn't get in the way. – luk32 Sep 20 '18 at 12:31
3

The C++ standard does not require an implementation to de-duplicate string literals.

When a string literal resides in another translation unit or another shared library that would require the linker (ld) or runtime-linker (ld.so) to do the string literal de-duplication. Which they don't.

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271