implementation defined behaviour and reading unicode to buffer

Question

I have two questions, one small one, hence I will ask them together. Is implementation defined behaviour as dangerous as undefined behaviour?

I read some unicode string from file using this code:

 char buff[1000];
 while (fgets(buf,1000, ptr_file) != NULL)
        printf("line: %s",buf);

I believe the unicode characters in the file where saved in UTF8 encoding. But each UTF8 value was more than 128 when I checked. Nevertheless the array is of char type as you can see (meaning range -127,128). But the string was correctly printed. What happened? Did I invoke UB?

@ShafikYaghmour: didn't get your point but I think on my second question it is impl. defined behaviour (read somewhere) — , Jul 15 '14 at 20:26
The usual conversion of unsigned char to signed char is to subtract 256 from every value greater than 127. This preserves the bit pattern for twos-complement machines. As long as the bit pattern remains the same, you'll get the expected behavior. "Implementation defined" usually just reserves the right for a weird architecture to do things differently; you're very unlikely to run into weird architectures. — Mark Ransom, Jul 15 '14 at 21:18

jxh · Answer 1 · 2014-07-15T20:25:07.780

3

When the standard states that something has implementation defined behavior, it means the compiler writer must document what will happen for that something. The behavior is not undefined, but the behavior may differ among implementations.

The signed-ness of char is one such example. It is implementation defined whether it is signed or unsigned, but the compiler implementation should document it (and usually, it will provide a switch to let you choose which way you want it).

Note that char is itself a type that is distinct from signed char and unsigned char (as opposed to int which is synonymous with signed int).

Cross references: C.11: §6.2.5 ¶15 and C++.11: §3.9.1 ¶1.

edited Jul 15 '14 at 20:25

answered Jul 15 '14 at 20:19

jxh

69,070
8
110
193

so can we say impl defined behaviour is less dangerous than UB? – Jul 15 '14 at 20:32
Yes, because the behavior is deterministic. But, the code that depends on implementation defined behavior is not maximally portable. – jxh Jul 15 '14 at 20:34
You read a byte into a char, and then you write a byte out as a char. What are you confused about? – jxh Jul 15 '14 at 20:36
the byte is unicode as I wrote and has value >128. My buffer is of char type. char has range: -127,128 – Jul 15 '14 at 20:37
That is what your implementation is choosing to represent your `char` as, but the value has a binary representation that translates to the correct unicode character. If you really care about seeing the "integral" values in your buffer, print them as unsigned, or store them in `uint8_t`. – jxh Jul 15 '14 at 20:39
the concern is char range is -127,128. But I write to it value which has value say 202 .. – Jul 15 '14 at 20:40
The eight bit binary representation of 202 is 11001010, and this is what is stored in your char. This also represents the value -55 with a signed 8-bit `char`, but the binary representation is what is getting interpreted when it gets printed. – jxh Jul 15 '14 at 20:43
please look up on some places it is said this can be integer overflow and either impl defined behaviour or undefined behaviour – Jul 15 '14 at 20:45
2

Signed integer overflow only happens when you do arithmetic. And you are not doing arithmetic. – jxh Jul 15 '14 at 20:47
@jxh: It's not that straightforward: http://stackoverflow.com/questions/18922601/is-char-foo-255-undefined-behavior-if-char-is-signed, you see? just I had bit different situation that is why I asked – Jul 15 '14 at 20:51
If you feel that my answer is not helpful to you, I am happy to delete it, but I feel your question is unclear. You have shifted from why your `char` shows negative values to how values got stored into your buffer in the first place. Those are totally different questions. In any case, reading values from a file is different from assigning a literal value to a variable, so the issue you note does not apply. Note that the assignment of an out of range literal value requires "arithmetic" to get a sensible value assigned. – jxh Jul 15 '14 at 23:06

score 2 · Answer 2 · edited May 23 '17 at 12:20

To answer second question I think there is no UB with any code point represented in UTF-8 encoding, since referring to C99 latest draft 6.2.5, p.3 (emphasis mine):

An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative. If any other character is stored in a char object, the resulting value is implementation-defined but shall be within the range of values that can be represented in that type.

This might be useful to add that fgets function has prototype as:

char *fgets(char * restrict s, int n, FILE * restrict stream);

For example diacritic ś is encoded in UTF-8 as two bytes: C5 (197 in decimal, so it's outside -128..127 range assuming signed variant of char) and 9B. It's implementation-defined that C5 is actually stored in a char object. As UTF-8 encoding "produces" bytes representation, there is no practical issue with storing value in any single-byte range.

For first question check: Undefined, unspecified and implementation-defined behavior.

implementation defined behaviour and reading unicode to buffer

2 Answers2