Bit extension occuring when trying to store a "byte" of information into a %x

Question

So I currently need to read in a string of hex ASCII characters and use them to determine a specific opcode. To do so I open a text file(its not really but for simple explanation we'll call it that) then read in the line. So if I get a line like 40f1... I have the function read the 2 characters and store them as an unsigned int into mem_byte. Then I cast it into a char to use as an array and retain the "byte" worth of information or the numerical value of two hex digits which were obtained by reading in ASCII character representation of 2 hex digits.

void getInstructions(FILE* read_file, int* stack_point, char** instructions, int size)
{
unsigned int mem_byte;
int ins_idx = 0;
char* ins_set = malloc(sizeof(char) * size);        //Set size of the array of bytes for memory. Temporarily holds the memory of the program

fscanf(read_file, "%d", stack_point);

//Reading in the memory from the program file
fscanf(read_file, " %2x", &mem_byte);       //Initial read to clear whitespace
ins_set[ins_idx] = (char) mem_byte;
ins_idx++;

while(fscanf(read_file, "%2x", &mem_byte) != 0) //Loops and reads 1 byte at a time until the fscanf hits the whitespace/end of the line its reading
{
    ins_set[ins_idx] = (char) mem_byte;
    printf("Byte: %x\n", ins_set[ins_idx]);
    ins_idx++;
}

strcpy(*instructions, ins_set);         //Copy the instruction set back to the original pointer for memory
free(ins_set);

return;

}

So the problem I run into is that if I print out the test results, I get

Byte: 40
Byte: fffffff1

Which means that the code is extending the char into a 4 byte data type. I am not sure whether or not the char is holding information from the unsigned int and prints it out or I am misunderstanding how %x or how type casting works. I would like to have my char instructions array to hold only 2 hex digits worth of information and nothing more.

You're on a machine with signed plain `char`. Characters in the range 0x80..0xFF appear as negative numbers when promoted to `int`, as happens in a call to `printf()`. You can cast the value (`(unsigned char)ins_set[ins_idx]`), or mask it (`ins_set[ins_idx] & 0xFF`) or tell `printf()` to treat it as a char type (`%hhx`). The first two work with any version of C; the last only works with C99 or later. — Jonathan Leffler, Aug 13 '17 at 23:30
@JonathanLeffler Using the `&` operator on signed integers that have negative values has implementation-defined results and isn't guaranteed to resolve this issue. — autistic, Aug 13 '17 at 23:34
@Seb: Unlike, say, the shift operators, `&` does not have caveats associated with it about signed vs unsigned integers. The standard (§6.5.10) says: _¶3 The usual arithmetic conversions are performed on the operands. — ¶4 The result of the binary `&` operator is the bitwise AND of the operands (that is, each bit in the result is set if and only if each of the corresponding bits in the converted operands is set)._ No clauses that allow it to have implementation or undefined behaviour for negative signed `int` values. Similarly for `|` and `^` too (and `~`). — Jonathan Leffler, Aug 14 '17 at 00:22
@JonathanLeffler Does `&` specify whether it operates solely upon value bits? Or might it also operate upon (or neglect to operate upon) sign bits and/or padding bits? I notice that [C11/6.2.6.2p3](http://port70.net/~nsz/c/c11/n1570.html#6.2.6.2p3) mentions that negative zeros can be formed using the `&` operator, and that the point after it mentions UB associated with implementations that don't support negative zeros. That states contrary to your advice... I'll trust the standard. — autistic, Aug 14 '17 at 00:40
@Seb: What I quoted is the whole of what the standard says. Nothing about value vs other bits. — Jonathan Leffler, Aug 14 '17 at 00:41
@JonathanLeffler Precisely, thus leaving that to be a/ implementation-defined or b/ undefined by the standard. In addition, you have the "negative zero" issue which "shall be generated only by: the `&`, `|`, `^`, `~`, `<<`, and `>>` operators with operands that produce such a value"... which is rather vague when compared to its surroundings. — autistic, Aug 14 '17 at 00:46
@Seb: I think you're trying to make the standard more complex than it is. The standard says that each pair of bits in the (converted) operands is AND'd to produce the result. What that value means is separate; the result is clearly defined as a bit pattern. If you can identify an actual modern system where there's any conceivable issue, I'll be very surprised — and interested. — Jonathan Leffler, Aug 14 '17 at 00:49
@JonathanLeffler The point is not about issues in the present, as those are often the least troublesome, but also well into the future. StackOverflow acknowledges this in a number of places, and to brush that aside seems disrespectful and neglectful of the community. I'm trying to suggest avoiding possible issues not just now, but well into the future, by giving answers that are strictly well-defined, where possible. — autistic, Aug 14 '17 at 01:00
@Seb: Newly designed modern systems would not be designed as weirdly as you seem to suppose. Those rules are backwards compatibility issues for now antique systems that are now out of service, or residing in backwaters where mainframes still have a use. It is incredibly unlikely that a new commercially viable CPU would be designed that didn't use 2's complement arithmetic on 'powers of 2' type sizes. 36-bit, 60-bit, etc systems are a thing of the (dim distant) past; those are what the exceptions are for. … Anyway, the decision is yours to make — and those who bother to read these comments. — Jonathan Leffler, Aug 14 '17 at 01:03
@JonathanLeffler 1/ It must thus be inappropriate for future visitors to ask about hardware of the past, 2/ as identified in standard-ese questions I've asked in the past, not all UB/implementation-defined behaviour has *physical machines* in mind as the rationale, 3/ if you can so accurately predict the future, please tell me the winning lottery ticket numbers for next week. — autistic, Aug 14 '17 at 01:08

autistic · Answer 1 · 2017-08-13T23:42:09.883

2

Arguments of type char, short, etc get implicitly converted to int when they're passed to variadic functions such as printf.

Thus, a negative value of one of those types will be sign-extended so that it holds the same value of type int; -1 as a char (which is commonly 0xFF as an unsigned char) will be implicitly converted to -1 as an int (which it seems would hold an underlying representation of 0xFFFFFFFF on your system).

Consider casting your argument to unsigned char to mitigate the sign extension you've noticed.

e.g. printf("Byte: %x\n", (unsigned char) ins_set[ins_idx]);

edited Aug 13 '17 at 23:42

answered Aug 13 '17 at 23:25

autistic

1
3
35
80

I now understand that casting it to an unsigned int would give me the correct printed results but does the physical memory at ins_set[ins_idx] store f1 or would it still store fffffff1? – M. Youn Aug 13 '17 at 23:45
I feel like I'm repeating myself... because I am... "Arguments of type `char`, `short`, etc get implicitly converted to `int` when they're passed to variadic functions such as `printf`." I can't state with certainty what `ins_set[ins_idx]` stores, because: 1/ you haven't given an MCVE (tsssk tsssk! naughty!) and there's inconsistency between your output and your code. 2/ it's possible that you could've changed `ins_set` to store `unsigned int` or `unsigned long` values. 3/ there's the possibility that `CHAR_BIT` could be 32. Since you don't believe my first paragraph, I'll cite: – autistic Aug 14 '17 at 00:26
[7.21.6.1p8](http://port70.net/~nsz/c/c11/n1570.html#7.21.6.1p8) states that the `X` format specifier corresponds to an `unsigned int` argument. – autistic Aug 14 '17 at 00:26
[6.5.2.2p6](http://port70.net/~nsz/c/c11/n1570.html#6.5.2.2p6) covers the conversions generally: *"If the expression that denotes the called function has a type that does include a prototype, the arguments are implicitly converted, as if by assignment, to the types of the corresponding parameters, taking the type of each parameter to be the unqualified version of its declared type. The ellipsis notation in a function prototype declarator causes argument type conversion to stop after the last declared parameter. The default argument promotions are performed on trailing arguments."* – autistic Aug 14 '17 at 00:27
[6.3.1.1p2](http://port70.net/~nsz/c/c11/n1570.html#6.3.1.1p2) covers integer promotions relevant here: *"If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions."* – autistic Aug 14 '17 at 00:29
Providing you've not lied about the type of `ins_set` and `CHAR_BIT` is 8, then the internal representation of `ins_set[ins_idx]` is much more likely to resemble that of `0xF1` as an `unsigned char`. **The cause of what you're seeing is (most likely) the conversion I've described above.** – autistic Aug 14 '17 at 00:33
Does C really allow for things bigger than a char, like an unsigned int/long, to fit in what was declared a char? I went back and changed everything back to unsigned char and I feel like it does what I want it to. Now I'm just curious about how this problem arose. – M. Youn Aug 14 '17 at 00:42
No. You need to go back and read what I've written. What you're seeing is the result of a conversion. If you convert a `char` value to an `int` value, the value gets widened (and sign-extended, if `char` is a signed type). That's what's happening, except you're not explicitly asking for the conversion to take place; it's an ***implicit*** conversion. – autistic Aug 14 '17 at 00:50
... and for the third time, that conversion is occurring *just before* the call to `printf`, the result of which gets stored into *the argument to `printf`* (an `unsigned int`, not a `char)... – autistic Aug 14 '17 at 00:52
Out of curiousity, suppose you explicitly declare a variable for your argument which has `unsigned int` type... say: `unsigned int argument = ins_set[ins_idx];`... and then you print that: `printf("%02X\n", argument);`... This makes the conversion more explicit. What do you think the output will be? Test that theory, and in testing it, observe the value of `argument` (using the above `printf`), `ins_set[ins_idx]` (using `printf("%d\n", ins_set[ins_idx]);`)... Allow yourself to become familiar with the value *before* and *after* the conversion... – autistic Aug 14 '17 at 01:05

Bit extension occuring when trying to store a "byte" of information into a %x

1 Answers1

Linked