2

I would like to find a maximally efficient way to compute a char that contains the least significant bits of an int in C++11. The solution must work with any possible standards-compliant compiler. (I'm using the N3290 C++ draft spec, which is essentially C++11.)

The reason for this is that I'm writing something like a fuzz tester, and want to check libraries that require a std::string as input. So I need to generate random characters for the strings. The pseudo-random generator I'm using provides ints whose low bits are pretty uniformly random, but I'm not sure of the exact range. (Basically the exact range depends on a "size of test case" runtime parameter.)

If I didn't care about working on any compiler, this would be as simple as:

inline char int2char(int i) { return i; }

Before you dismiss this as a trivial question, consider that:

  • You don't know whether char is a signed or unsigned type.

  • If char is signed, then a conversion from an unrepresentable int to a char is "implementation-defined" (§4.7/3). This is far better than undefined, but for this solution I'd need to see some evidence that the standard prohibits things like converting all ints not between CHAR_MIN and CHAR_MAX to '\0'.

  • reinterpret_cast is not permitted between a signed and unsigned char (§5.2.10). static_cast performs the same conversion as in the previous point.

  • char c = i & 0xff;--though it silences some compiler warnings--is almost certainly not correct for all implementation-defined conversions. In particular, i & 0xff is always a positive number, so in the case that c is signed could quite plausibly not convert negative values of i to negative values of c.

Here are some solutions that do work, but in most of these cases I'm worried they won't be as efficient as a simple conversion. These also seem too complicated for something so simple:

  • Using reinterpret_cast on a pointer or reference, since you can convert from unsigned char * or unsigned char & to char * or char & (but at the possible cost of runtime overhead).

  • Using a union of char and unsigned char, where you first assign the int to the unsigned char, then extract the char (which again could be slower).

  • Shifting left and right to sign-extend the int. E.g., if i is the int, running c = ((i << 8 * (sizeof(i) - sizeof(c)) >> 8 * (sizeof(i) - sizeof(c)) (but that's inelegant, and if the compiler doesn't optimize away the shifts, quite slow).

Here's a minimal working example. The goal is to argue that the assertions can never fail on any compiler, or to define an alternate int2char in which the assertions can never fail.

#include <algorithm>
#include <cassert>
#include <cstdio>
#include <cstdlib>

using namespace std;

constexpr char int2char(int i) { return i; }

int
main(int argc, char **argv)
{
  for (int n = 1; n < min(argc, 127); n++) {
    char c = -n;
    int i = (atoi(argv[n]) << 8) ^ -n;
    assert(c == int2char(i));
  }
  return 0;
}

I've phrased this question in terms of C++ because the standards are easier to find on the web, but I am equally interested in a solution in C. Here's the MWE in C:

#include <assert.h>
#include <stdlib.h>

static char int2char(int i) { return i; }

int
main(int argc, char **argv)
{
  for (int n = 1; n < argc && n < 127; n++) {
    char c = -n;
    int i = (atoi(argv[n]) << 8) ^ -n;
    assert(c == int2char(i));
  }
  return 0;
}
user3188445
  • 4,062
  • 16
  • 26
  • 3
    Please do **not** use C tag for C++. These are different languages! – too honest for this site Jul 17 '15 at 23:19
  • 1
    Do you care only about the *bits* of the given integer, or about the resulting *value*? If you only want the bits and don't care about the resulting value then masking with `0xff` (or, rather, `(1 << CHAR_BIT) - 1`) is the best solution. It's when you start to worry about the resulting *value* that things get problematic. Can you please give us a use-case, or the actual problem you try to solve? – Some programmer dude Jul 17 '15 at 23:25
  • I would be glad for a C11 solution as well. I just happened to have a copy of the C++11 lying around to refer to, but if you can show me how to do the same thing in C, I'd be just as happy. – user3188445 Jul 17 '15 at 23:26
  • 1
    I care mostly about the bits of the value. For example, if I have a random integer that is uniformly distributed between, say, -1024 and +1023, I would like to end up with a char that is also uniformly distributed over its entire range of values. – user3188445 Jul 17 '15 at 23:28
  • Also, how strict is the requirement to use `char`? Since you're not wanting an actual printable *character*, why not use a better type like e.g. `int8_t`? Then you sidestep a lot of problems because you then *know* that it's signed. – Some programmer dude Jul 17 '15 at 23:39
  • I need to use char, because I want to store these in a string. The better type would be uint8_t, because unsigned types are so much easier to reason about. – user3188445 Jul 17 '15 at 23:41
  • And have you thought about simply dividing by `1024.0` and then multiplying the result with `128` and cast to the appropriate integer type? Then you would still have the uniform distribution from your original range. – Some programmer dude Jul 17 '15 at 23:41
  • When you say "string" I assume just a sequence of bytes, not an actual *text* string that you will print with e.g. `printf`? Because many of the values of a `char` is simply not printable. Then why not go with an array (fixed-size or dynamically allocated) of the type you want? – Some programmer dude Jul 17 '15 at 23:43
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/83586/discussion-between-user3188445-and-joachim-pileborg). – user3188445 Jul 17 '15 at 23:47
  • 1
    I note the assumption that char is always 8 bits. This is not the case, char can be any size. Or are u narrowing the scope to machines with 8-bit chars – pm100 Jul 18 '15 at 00:03
  • I suppose it would be nice to work with any compiler at all (in which case my MWE has to be revised), but I'm okay assuming chars are 8 bits. – user3188445 Jul 18 '15 at 00:06
  • "where you first assign the int to the unsigned char, then extract the char (which again could be slower)" - why do you believe that could be slower? Given that you're interested in *bits* (rather than *numeric value*), this sounds like semantically the correct thing to do (although unioning is kind of icky, I would probably `memcpy`). – Oliver Charlesworth Jul 18 '15 at 00:10
  • 1
    Taking that one step further; does `char ret; memcpy(&ret, (char *)&i + OFFSET, 1); return ret;` satisfy your criteria? (where `OFFSET` would be a macro dependent on an endianness check or something.) – Oliver Charlesworth Jul 18 '15 at 00:17
  • 1
    Do you REALLY need code that is 100% sure to run on 36-bit mainframe computers, and weird 24-bit DSPs? Or is it OK if it works correctly on x86, ARM, MIPS, 68K, 29K and Alpha? If the latter, I'd say the behaviour of `int x = rand(); char c = x;` will be sufficiently well defined [in the sense that the compiler will just truncate the integer into a value that fits in a `char` without any further conversion] – Mats Petersson Jul 18 '15 at 00:20
  • I guess you could use SFINAE to only use complicated code when char is actually signed. Also you seem to only be guessing about negative performance impacts of your solutions. Do you have any evidence for your claims? – Daniel Jour Jul 18 '15 at 00:58
  • Your `union` suggestion causes undefined behaviour on C++ (only the most recently written member of a union may be read) – M.M Jul 18 '15 at 03:29
  • Your other suggestions seem to assume 2's complement . It's unclear why you are comfortable assuming 2's complement but not assuming simple truncation for int->char conversion. – M.M Jul 18 '15 at 03:36
  • @MattMcNabb: Can you point me to where in the standard the union suggestion would be prohibited? §3.9.1.1 says char and unsigned char are both the same size and use all of their bits. I believe the strict aliasing rules also allow use of pointers to both union members, and to cast the pointer to the union to the pointer to the active member (though I can't find where that is discussed in the standard). I can't find any language that would prohibit what I suggested. – user3188445 Jul 18 '15 at 05:04
  • @user3188445 [see here](http://stackoverflow.com/questions/11373203/accessing-inactive-union-member-undefined) – M.M Jul 18 '15 at 05:41

2 Answers2

0

a far better way is to have an array of chars and generate a random number to pick a char from that array. This way you get 'well behaved' characters; or at least characters with well defined badness. If you really want all 256 chars (note 8 bit assumption) then create an array with 256 entries in it ('a','b',....'\t','n'.....)

This will be portable too

pm100
  • 48,078
  • 23
  • 82
  • 145
  • But that's likely to be suboptimal even with optimization. I ought to be able to do this conversion without touching memory. – user3188445 Jul 18 '15 at 05:21
0

Given that you appear to be interested in bit value (rather than numeric value), and have also asked for C solutions, I'm going to post what I believe to be something that's compliant and optimal:

inline char int2char(int i) {
    char ret;
    memcpy(&ret, (char *)&i + OFFSET, 1);
    return ret;
}

where OFFSET is a macro that expands to either 0 or sizeof(int)-1, based on an endianness check.

AFAICS, this works invariant of whether char is signed or unsigned, of what representation is used for negative values, or of the width of char or int. It doesn't rely on any weird type-punning tricks, and has no branching or complex operations (such as divide).

I say "optimal" because I'm assuming that any sane compiler treats memcpy as an intrinsic, and thus will do something smart here.

Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
  • So indeed on x86-64, both clang and gcc with -O seem to generate the same code for memcmp or a simple conversion. Without optimization, g++ seems to generate substantially worse code for the memcmp case. (It doesn't actually use any string instructions, but somehow it takes 4 instructions to do what the conversion function does in 2 instructions.) – user3188445 Jul 18 '15 at 04:10
  • But with optimization, I get the same code out of `const unsigned char c = i; return reinterpret_cast(c);`, which works regardless of endianness. Though the code is even worse with no optimization. – user3188445 Jul 18 '15 at 04:57
  • @user3188445: Sure, and that's a good thing. It shows that the code I'm proposing (which AFAICS is 100% portable) is as fast as the less portable code that you're unhappy with (at least on your platform). – Oliver Charlesworth Jul 18 '15 at 05:01
  • @user3188445: Actually, on second thoughts, I'm not convinced that code *is* unportable. So why were you unhappy with it as an option? – Oliver Charlesworth Jul 18 '15 at 05:02
  • Well, in an ideal world the generated code would be decent even without optimization. I do often compile without optimization to save time during development, and this is pretty inner-loop kind of stuff, transforming a big array of integers into a potentially long string. It just seems kind of insane that I have to do this whole big thing and rely on compiler optimization to do something as simple as what actually happens with "(char) i". – user3188445 Jul 18 '15 at 05:18
  • @user3188445: Expecting code at `-O0` to be performant just isn't reasonable. – Oliver Charlesworth Jul 18 '15 at 05:19