20

For example, I have a cstring "E8 48 D8 FF FF 8B 0D" (including spaces) which needs to be converted into the equivalent unsigned char array {0xE8,0x48,0xD8,0xFF,0xFF,0x8B,0x0D}. What's an efficient way to do this? Thanks!

EDIT: I can't use the std library... so consider this a C question. I'm sorry!

dda
  • 6,030
  • 2
  • 25
  • 34
Gbps
  • 857
  • 2
  • 14
  • 29

7 Answers7

37

This answers the original question, which asked for a C++ solution.

You can use an istringstream with the hex manipulator:

std::string hex_chars("E8 48 D8 FF FF 8B 0D");

std::istringstream hex_chars_stream(hex_chars);
std::vector<unsigned char> bytes;

unsigned int c;
while (hex_chars_stream >> std::hex >> c)
{
    bytes.push_back(c);
}

Note that c must be an int (or long, or some other integer type), not a char; if it is a char (or unsigned char), the wrong >> overload will be called and individual characters will be extracted from the string, not hexadecimal integer strings.

Additional error checking to ensure that the extracted value fits within a char would be a good idea.

James McNellis
  • 348,265
  • 75
  • 913
  • 977
  • 3
    Because I cannot give two correct answers, I went ahead and upvoted this one, as this definitely is a great solution for C++ users! – Gbps Jul 11 '10 at 00:50
14

You'll never convince me that this operation is a performance bottleneck. The efficient way is to make good use of your time by using the standard C library:

static unsigned char gethex(const char *s, char **endptr) {
  assert(s);
  while (isspace(*s)) s++;
  assert(*s);
  return strtoul(s, endptr, 16);
}

unsigned char *convert(const char *s, int *length) {
  unsigned char *answer = malloc((strlen(s) + 1) / 3);
  unsigned char *p;
  for (p = answer; *s; p++)
    *p = gethex(s, (char **)&s);
  *length = p - answer;
  return answer;
}

Compiled and tested. Works on your example.

Norman Ramsey
  • 198,648
  • 61
  • 360
  • 533
8
  • Iterate through all the characters.
    • If you have a hex digit, the number is (ch >= 'A')? (ch - 'A' + 10): (ch - '0').
      • Left shift your accumulator by four bits and add (or OR) in the new digit.
    • If you have a space, and the previous character was not a space, then append your current accumulator value to the array and reset the accumulator back to zero.
Mark
  • 6,269
  • 2
  • 35
  • 34
Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • +1: This is probably the most straightforward and simple way to do it. – James McNellis Jul 10 '10 at 23:22
  • That's basically what I did, except for using switch instead of ternary test. Depending on compiler and processor architecture one or the other may be faster. But you should also test every character is in range 0-9A-F, and it makes testing the same thing two times. – kriss Jul 10 '10 at 23:42
  • 1
    @kriss: It's all in the assumptions. You assume that there must be exactly two hex digits and one space between each value, mine allows omission of a leading zero or multiple spaces, but assumes that there are no other classes of characters in the string. If you can't assume that, I'd probably choose to do validation separately, by testing `if (s[strspn(s, " 0123456789ABCDEF")]) /* error */;` Sure, it's another pass on the string, but so much cleaner. Or avoid the second pass over the string by using `isspace` and `isxdigit` on each character, which uses a lookup table for speed. – Ben Voigt Jul 11 '10 at 00:19
  • Looping around switches is not really an issue, I do not really take it as a difference. I choosed to assume there was exactly two hex char in input, because if you allow more than that you should also check range for values. And what about allowing negativer numbers, we would have to manage sign, etc. switch *is* a kind of lookup table... (and another fast conversion method would be to really use one implemented as an array). – kriss Jul 11 '10 at 00:40
  • The problem specified that all inputs were unsigned. The problem didn't specify that there would always be zeros padding to exactly two digits (e.g. all of these fit in a `char`: `0xA`, `0x0A`, `0x000A`) or just one space, although these assumptions were true on the sample input. – Ben Voigt Jul 11 '10 at 01:23
  • You should use isxdigit first. Or see R's comment above. – Mark Dec 09 '11 at 12:42
5

use the "old" sscanf() function:

string s_hex = "E8 48 D8 FF FF 8B 0D"; // source string
char *a_Char = new char( s_hex.length()/3 +1 ); // output char array

for( unsigned i = 0, uchr ; i < s_hex.length() ; i += 3 ) {
    sscanf( s_hex.c_str()+ i, "%2x", &uchr ); // conversion
    a_Char[i/3] = uchr; // save as char
  }
delete a_Char;
amigo
  • 51
  • 1
  • 1
5

If you know the length of the string to be parsed beforehand (e.g. you are reading something from /proc) you can use sscanf with the 'hh' type modifier, which specifies that the next conversion is one of diouxX and the pointer to store it will be either signed char or unsigned char.

// example: ipv6 address as seen in /proc/net/if_inet6:
char myString[] = "fe80000000000000020c29fffe01bafb";
unsigned char addressBytes[16];
sscanf(myString, "%02hhx%02hhx%02hhx%02hhx%02hhx%02hhx%02hhx
%02hhx%02hhx%02hhx%02hhx%02hhx%02hhx%02hhx%02hhx%02hhx", &addressBytes[0],
&addressBytes[1], &addressBytes[2], &addressBytes[3], &addressBytes[4], 
&addressBytes[5], &addressBytes[6], &addressBytes[7], &addressBytes[8], 
&addressBytes[9], &addressBytes[10], addressBytes[11],&addressBytes[12],
&addressBytes[13], &addressBytes[14], &addressBytes[15]);

int i;
for (i = 0; i < 16; i++){
    printf("addressBytes[%d] = %02x\n", i, addressBytes[i]);
}

Output:

addressBytes[0] = fe
addressBytes[1] = 80
addressBytes[2] = 00
addressBytes[3] = 00
addressBytes[4] = 00
addressBytes[5] = 00
addressBytes[6] = 00
addressBytes[7] = 00
addressBytes[8] = 02
addressBytes[9] = 0c
addressBytes[10] = 29
addressBytes[11] = ff
addressBytes[12] = fe
addressBytes[13] = 01
addressBytes[14] = ba
addressBytes[15] = fb
Diego Medaglia
  • 253
  • 2
  • 9
0

For a pure C implementation I think you can persuade sscanf(3) to do what you what. I believe this should be portable (including the slightly dodgy type coercion to appease the compiler) so long as your input string is only ever going to contain two-character hex values.

#include <stdio.h>
#include <stdlib.h>


char hex[] = "E8 48 D8 FF FF 8B 0D";
char *p;
int cnt = (strlen(hex) + 1) / 3; // Whether or not there's a trailing space
unsigned char *result = (unsigned char *)malloc(cnt), *r;
unsigned char c;

for (p = hex, r = result; *p; p += 3) {
    if (sscanf(p, "%02X", (unsigned int *)&c) != 1) {
        break; // Didn't parse as expected
    }
    *r++ = c;
}
bjg
  • 7,457
  • 1
  • 25
  • 21
  • Declare `c` as `unsigned int`, otherwise you could overwrite other local variables (or worse yet, your return address). – Ben Voigt Jul 11 '10 at 00:26
  • But generally scanf is going to take longer to figure out the format code than my entire answer will, and the question did ask for an *efficient* way. – Ben Voigt Jul 11 '10 at 00:28
  • @Ben Voigt. Yes but does efficient mean run-time or programmer-time? '-) Anyway thanks for pointing out that I should have made `c` an `insigned int` and coerced that into the `result` array. – bjg Jul 11 '10 at 01:09
  • 1
    UB. Since at expected end `p` points one byte AFTER terminating zero. – Marek R Nov 25 '16 at 14:47
  • @MarekR Good catch. I was clearly in two minds writing this (6 years ago), having declared a `cnt` variable and then having not used it – bjg Nov 27 '16 at 00:18
-1

The old C way, do it by hand ;-) (there is many shorter ways, but I'm not golfing, I'm going for run-time).

enum { NBBYTES = 7 };
char res[NBBYTES+1];
const char * c = "E8 48 D8 FF FF 8B 0D";
const char * p = c;
int i = 0;

for (i = 0; i < NBBYTES; i++){
    switch (*p){
    case '0': case '1': case '2': case '3': case '4':
    case '5': case '6': case '7': case '8': case '9':
      res[i] = *p - '0';
    break;
    case 'A': case 'B': case 'C': case 'D': case 'E': case 'F':
      res[i] = *p - 'A' + 10;
    break;
   default:
     // parse error, throw exception
     ;
   }
   p++;
   switch (*p){
   case '0': case '1': case '2': case '3': case '4':
   case '5': case '6': case '7': case '8': case '9':
      res[i] = res[i]*16 + *p - '0';
   break;
   case 'A': case 'B': case 'C': case 'D': case 'E': case 'F':
      res[i] = res[i]*16 + *p - 'A' + 10;
   break;
   default:
      // parse error, throw exception
      ;
   }
   p++;
   if (*p == 0) { continue; }
   if (*p == ' ') { p++; continue; }
   // parse error, throw exception
}

// let's show the result, C style IO, just cout if you want C++
for (i = 0 ; i < 7; i++){
   printf("%2.2x ", 0xFF & res[i]);
}
printf("\n");

Now another one that allow for any number of digit between numbers, any number of spaces to separate them, including leading or trailing spaces (Ben's specs):

#include <stdio.h>
#include <stdlib.h>

int main(){
    enum { NBBYTES = 7 };
    char res[NBBYTES];
    const char * c = "E8 48 D8 FF FF 8B 0D";
    const char * p = c;
    int i = -1;

    res[i] = 0;
    char ch = ' ';
    while (ch && i < NBBYTES){
       switch (ch){
       case '0': case '1': case '2': case '3': case '4':
       case '5': case '6': case '7': case '8': case '9':
          ch -= '0' + 10 - 'A';
       case 'A': case 'B': case 'C': case 'D': case 'E': case 'F':
          ch -= 'A' - 10;
          res[i] = res[i]*16 + ch;
          break;
       case ' ':
         if (*p != ' ') {
             if (i == NBBYTES-1){
                 printf("parse error, throw exception\n");
                 exit(-1);
            }
            res[++i] = 0;
         }
         break;
       case 0:
         break;
       default:
         printf("parse error, throw exception\n");
         exit(-1);
       }
       ch = *(p++);
    }
    if (i != NBBYTES-1){
        printf("parse error, throw exception\n");
        exit(-1);
    }

   for (i = 0 ; i < 7; i++){
      printf("%2.2x ", 0xFF & res[i]);
   }
   printf("\n");
}

No, it's not really obfuscated... but well, it looks like it is.

kriss
  • 23,497
  • 17
  • 97
  • 116