6

For the same program:

const char* s = "abcd";
auto x1 = reinterpret_cast<const int64_t*>(s);
auto x2 = reinterpret_cast<const char*>(x1);
std::cout << *x1 << std::endl;
std::cout << x2 << std::endl; // Always "abcd"

In gcc5(link): 139639660962401
In gcc8(link): 1684234849

  1. Why does the value vary according to different compiler versions?
  2. What is then a compiler safe way to move from const char* to int64_t and backward(just like in this problem - not for actual integer strings but one with other chars as well)?
tangy
  • 3,056
  • 2
  • 25
  • 42
  • 9
    Accessing `*x1` is **undefined behavior**. `s` is pointing to memory that is guaranteed to have only 5 bytes allocated for the string data. Reading from `*x1` tries to access memory for 8 bytes instead. The last 3 bytes are undefined and may not even be allocated, depending on how the string data is managed. – Remy Lebeau Jan 17 '19 at 19:12
  • 7
    You are violating strict aliasing (UB). You're not allowed to do `*x1` since you don't actually point to a `int64_t`. What are you actually trying to accomplish by doing this? – NathanOliver Jan 17 '19 at 19:12
  • You may get more insight by printing *x1 in hex. – user877329 Jan 17 '19 at 19:14
  • 1
    Be careful with `reinterpret_cast`. The compiler will let you reinterpret a lot of things into a lot of things you shouldn't dereference. There are [a lot of rules](https://en.cppreference.com/w/cpp/language/reinterpret_cast) to using `reinterpret_cast` and the list of things you can safely do in practice is rather short. – François Andrieux Jan 17 '19 at 19:16
  • is there then a correct way to do (2)? Because upto 8 bytes a char sequence should be convertible to int64_t right? – tangy Jan 17 '19 at 19:17
  • 6
    FYI, `"abcd"` is only 5 bytes, `0x61 0x62 0x63 0x64 0x00`. 139639660962401 is hex `0x00007F0064636261` whereas 1684234849 is hex `0x0000000064636261` You can see that both numeric values include the same 5 bytes of the `"abcd"` string as expected, but the larger numeric also includes a random byte value `0x7F` after the null terminator, since the `int64_t*` pointer is accessing undefined memory, whereas the smaller numeric does not have that random byte value in it. – Remy Lebeau Jan 17 '19 at 19:17
  • *is there then a correct way to do (2)? Because upto 8 bytes a char sequence should be convertible to int64_t right?* No. Unless you start with a `int64_t` you do not have one and reading the memory as one is UB. What problem are you trying to solve? – NathanOliver Jan 17 '19 at 19:20
  • 4
    The *politically correct* way to access 8 consequent bytes as a single `int64_t` is to use `memcpy()` instead of a type-cast – Remy Lebeau Jan 17 '19 at 19:21
  • 1
    I would suggest a simple function like `uint64_t chars_to_int(const std::string& string) { uint64_t return_value = 0; for(const auto& a : string) { return_value += a; return_value <<= (sizeof(char) * 8); } return return_value; }` – taminob Jan 17 '19 at 19:21
  • @tangy in the [cppreference link](https://en.cppreference.com/w/cpp/language/reinterpret_cast), (2) is not casting a pointer to a `uint64_t*` pointer, like you are trying to do. It is casting a pointer to a `uintptr_t` instead, which is NOT itself a pointer at all, but is just an integer whose byte size is large enough to hold pointer values. That is a very big difference. Don't let the `ptr` in the type name fool you. `uintptr_t` is just an alias for `uint32_t` on 32bit platforms and `uint64_t` on 64bit platforms (or equivalent). – Remy Lebeau Jan 17 '19 at 19:26
  • 1
    @tangy `auto x1 = reinterpret_cast(s); std::cout << x1 << std::endl; auto x2 = reinterpret_cast(x1);` is perfectly safe. – Remy Lebeau Jan 17 '19 at 19:30
  • Thanks RemyLebeau Unterfliege for the suggestions solutions and others for the discussion. You'll could add it as an answer so it might help others? – tangy Jan 17 '19 at 19:32
  • @NathanOliver i have to store this to a dataset expecting only integral types and it is guaranteed that the const char* will always be <8. – tangy Jan 17 '19 at 19:37
  • @RemyLebeau You should submit your first comment as an answer, as it is the most meaningful and useful one. – okovko Jan 17 '19 at 20:23
  • @tangy to store the *contents* of the string into a 64-bit integer, you need to make sure the string data is at least 8 bytes when using a pointer type cast, otherwise you need to `memcpy()` the string data into a separate `(u)int64_t` variable (which is the best way to go). – Remy Lebeau Jan 17 '19 at 20:37

3 Answers3

6
  1. Why does the value vary according to different compiler versions?

Behaviour is undefined.

  1. What is then a compiler safe way to move from const char* to int64_t and backward

It is somewhat unclear what you mean by "move from const char* to int64_t". Based on the example, I assume you mean to create a mapping from a character sequence (of no greater length than fits) into a 64 bit integer in a way that can be converted back using another process - possibly compiled by another (version of) compiler.

First, create a int64_tobject, initialise to zero:

int64_t i = 0;

Get length of the string

auto len = strlen(s);

Check that it fits

assert(len < sizeof i);

Copy the bytes of the character sequence onto the integer

memcpy(&i, s, len);

(As long as the integer type doesn't have trap representations) The behaviour is well defined, and the generated integer will be the same across compiler versions as long as the CPU endianness (and negative number representation) remains the same.

Reading the character string back doesn't require copying because char is exceptionally allowed to alias all other types:

auto back = reinterpret_cast<char*>(&i);

Note the qualification in the last section. This method does not work if the integer is passed (across the network for example) to process running on another CPU. That can be achieved as well by bit shifting and masking so that you copy octets to certain position of significance using bit shifting and masking.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • Note also that you will never encounter a non-two's complement machine nor a machine where a byte is not 8 bits. – okovko Jan 17 '19 at 20:13
  • 2
    @okovko: Those will be caught at compile-time: `int64_t i = 0;` won't compile unless the machine is two's complement, and 64-bits is an integral number of bytes. – Ben Voigt Jan 17 '19 at 20:16
  • 2
    @okovko [except when you do](https://stackoverflow.com/a/2098444/2079303). – eerorika Jan 17 '19 at 20:18
  • @BenVoigt That's good to know, but I think eerorika was discussing passing a value between machines on a network, which precludes compilation. That said, there is no reason to be concerned about the complement nor the byte size, because those are simply not variables in 2019. – okovko Jan 17 '19 at 20:19
  • @eerorika let me know the day you pass a value to a Texas Instruments DSP over a network. – okovko Jan 17 '19 at 20:19
  • 2
    @okovko network is just one example of transferring data. Surely there are interfaces to connect a PC to a DSP? I agree that those encounters you mention are rare, but not impossible. – eerorika Jan 17 '19 at 20:27
  • Sure, but that's outside the scope of discussing portability. – okovko Jan 17 '19 at 20:29
2

When you dereference the int64_t pointer, it is reading past the end of the memory allocated for the string you casted from. If you changed the length of the string to at least 8 bytes, the integer value would become stable.

const char* s = "abcdefg"; // plus null terminator
auto x1 = reinterpret_cast<const int64_t*>(s);
auto x2 = reinterpret_cast<const char*>(x1);
std::cout << *x1 << std::endl;
std::cout << x2 << std::endl; // Always "abcd"

If you wanted to store the pointer in an integer instead, you should use intptr_t and leave out the * like:

const char* s = "abcd";
auto x1 = reinterpret_cast<intptr_t>(s);
auto x2 = reinterpret_cast<const char*>(x1);
std::cout << x1 << std::endl;
std::cout << x2 << std::endl; // Always "abcd"
nate
  • 1,771
  • 12
  • 17
  • 4
    Behaviour is still undefined. `int64_t` is not allowed alias a `char` (array). Besides, the array is not guaranteed to meet alignment requirement of `int64_t`. – eerorika Jan 17 '19 at 19:50
  • @eerorika I understand the alignment restriction, but not the aliasing one. – nate Jan 17 '19 at 19:55
  • @eerorika The alignment is well defined if the char array is declared at the top of the enclosing scope. If you had something like `char a; const char* s = "abcd";` then you would get a CPU exception on various ARM architectures. – okovko Jan 17 '19 at 20:51
  • 3
    @okovko C++ makes no guarantees about whether the call frame is aligned to some boundary, nor does it guarantee order of local variables in memory. Your ABI may give you guarantees, which you could take for granted if you don't mind restricting portability. Regardless, it is quite trivial to specify the alignment if you need it. – eerorika Jan 17 '19 at 21:03
  • @eerorika There's an underrated concept of the stable and normalized behaviors of machines and compilers that programmers can rely on, despite them not being specified. It's the true "standard" because it is actually implemented. – okovko Jan 17 '19 at 21:09
  • 1
    @okovko good luck when doing bughunting when your updated compiler in the future decides to do more aggresive optimization and undefined behavior goes its way ;-P eerorika is completly right – phön Jan 18 '19 at 08:17
  • @phön Okay, ghost hunter. – okovko Jan 18 '19 at 15:57
0

Based on what RemyLebeau pointed out in the comments of your post,

unsigned 5_byte_mask = 0xFFFFFFFFFF; std::cout << *x1 & 5_byte_mask << std::endl;

Should be a reasonable way to get the same value on a little endian machine with whatever compiler. It may be UB by one specification or another, but from a compiler's perspective, you're dereferencing eight bytes at a valid address that you have initialized five bytes of, and masking off the remaining bytes that are uninitialized / junk data.

okovko
  • 1,851
  • 14
  • 27