3

I recalled now at some places in my code I might have passed unsigned char* variables as parameters to functions such as strcpy and strtok -- which expect char *. My question is: is it a bad idea? Could it have caused issues?

e.g.

unsigned char * x = // .... some val, null terminated
unsigned char * y = // ... same here;
strcpy(x,y); // ps assuming there is space allocated for x

e.g., unsigned char * x = strtok(NULL,...)

  • It's pretty unclear what you'ere asking for. You should demonstrate some minimal case, that fails to compile, or either shows what's going wrong. – πάντα ῥεῖ Apr 03 '14 at 20:35
  • In general I am interested if it is a good idea to pass unsigned char * to a function which expects char *? (I don't get the down votes) –  Apr 03 '14 at 20:35
  • Where'd the `c++` tag go? Not interested in the answer for C++, or assuming (wrongly) that it must necessarily be the same as for C? – Ben Voigt Apr 03 '14 at 23:27
  • @BenVoigt: Yes I am more interested for C answer - and thought for C++ it would be similar –  Apr 04 '14 at 06:25
  • Well, C++ is mostly compatible with C in this area, but the rules that get you there have some significant differences. – Ben Voigt Apr 04 '14 at 06:36
  • @BenVoigt: Ok, well basically my question can be now: which buffer to use to hold UTF-8? `unsigned char *` or `char *`? And if I use `unsigned char *` which string functions will Not break? (I will scroll through the answers to see if there is an answer to this question) –  Apr 04 '14 at 06:52

3 Answers3

2

It's guaranteed to be ok (after you cast the pointer), because the "Strict Aliasing Rule" has a special exception for looking at the same object via both signed and unsigned variants.

See here for the rule itself. Other answers on that page explain it.

Community
  • 1
  • 1
Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • I didn't do casts though. But I think it still worked ... probably because even though unsigned char *, I know mostly it contained ASCII characters. Just I like using unsigned chars - also the SDK I was working with frequently required unsigned chars as parameters.... –  Apr 03 '14 at 20:49
  • @dmcr_code: Well, you might have a compiler where `char` is `unsigned char`, not `signed char`. The Standard doesn't say whether unqualified `char` is signed or not. In general, such code is portable only if you cast the pointer. – Ben Voigt Apr 03 '14 at 20:50
  • I got confused a bit because in the future I might want to use UTF 8 and unsigned char as buffer - so thought what would could happen - but then clearly some plain string functions will not be useful anyway. But strtok should be ok, since delimeter is an ASCII character –  Apr 03 '14 at 20:51
  • @dmcr_code: UTF-8 works just fine with *most* functions designed for single-byte strings. – Ben Voigt Apr 03 '14 at 20:52
  • Well yes - but my suspicions came form fact that I realized I might have used unsigned char* 's all over place - even though the buffers themselves should contain ASCII values - so this should not be a problem I guess? Also, like I said I probably didn't do casting but didn't get any issues. Do you recommend me to do anything about this? –  Apr 03 '14 at 20:54
  • If I move later to UTF8, it won't fit in char * anyway right? Some values might be larger than 127?then how I'd use strtok? –  Apr 03 '14 at 20:55
  • @dmcr_code: `unsigned char` can hold ASCII values just as well as `char` or `signed char`. And all these string functions will work correctly on `unsigned char` buffers even though they expect `char*`. You might have weird behavior in sorting, though, if you have non-ASCII data. – Ben Voigt Apr 03 '14 at 20:55
  • UTF-8 fits in `signed char` just fine, but lead bytes are interpreted as negative. – Ben Voigt Apr 03 '14 at 20:56
  • `strcpy` and all other C string / char handling functions are defined to work as if on `unsigned char`. So, no problem. – Deduplicator Apr 03 '14 at 20:56
  • @Deduplicator: Do you have a source for that? I expect all string-handling functions to work ok and find the NUL terminator ok, but that non-ASCII characters might compare either before or after ASCII characters. – Ben Voigt Apr 03 '14 at 20:58
  • @Ben Voigt:"And all these string functions will work correctly on unsigned char buffers even though they expect char*" ->This is pretty straightforward if the unsigned buffers contain ASCII like it is case with me, I got curious what happens if the values are Non Ascii characters? (but still null terminated) –  Apr 03 '14 at 21:02
  • @BenVoigt: and what should I do in general in my situation? Which is the best way to handle this strings? which type of buffer to store? etc. –  Apr 03 '14 at 21:06
  • @dmcr_code: Use whatever makes sense for your own data processing. If you're doing bitwise manipulation and don't want right-shift to cause sign extension, use `unsigned char`. – Ben Voigt Apr 03 '14 at 21:11
  • @BenVoigt: So far I used unsigned char* but values were ASCII. In future the buffer might contain UTF8 –  Apr 03 '14 at 21:19
2

The C aliasing rules have exceptions for signed/unsigned variants and for char access in general. So no trouble here.
Quote from the standard:

An object shall have its stored value accessed only by an lvalue expression that has one of the following types:88)
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the object,
— a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
— a character type.

All standard library functions treat any char arguments as unsigned char, so passing char*, unsigned char* or signed char* is treated the same.
Quote from the intro of <string.h>:

For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value).

Still, your compiler should complain if you get the signed-ness wrong, especially if you enable all warnings (you should, always).

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
  • And the only thing my Standard says about treatment as `unsigned char` is "The descriptions of many library functions rely on the C standard library for the signatures and semantics of those functions. In all such cases, any use of the restrict qualifier shall be omitted." (The question is tagged `c++`, too). Anyway, that's good to know. – Ben Voigt Apr 03 '14 at 21:09
  • Hm, I already dug out the C references. As C and C++ try to avoid needless incompatibilities, there should be something similar for C++. THough maybe your quote suffices... – Deduplicator Apr 03 '14 at 21:13
  • ps. This is for C, I thought it might not make the difference I will remove the C++ tag now –  Apr 03 '14 at 21:18
  • @dmcr: If the answer is the same for both, and it's already answered anyway, why bother? Anyway, I sometimes add a C or C++ tag, if it works for both. Just for better searching. – Deduplicator Apr 03 '14 at 21:23
  • @giorgim: No, `char` `signed char` and `unsigned char` (and pointers to them) are different types, it's just that all standard-library-functions handle `char` as if it was an `unsigned` integer type. – Deduplicator Dec 21 '14 at 16:36
0

The only problem with converting unsigned char * into char * (or vice versa) is that it's supposed to be an error. Fix it with a cast.

e.g,

function((char *) buff, len);

That being said, strcpy needs to have the null-terminating character (\0) to properly work. The alternative is to use memcpy.

But you shouldn't use unsigned char arrays with string handling functions. In C strings are char arrays, not unsigned char arrays. Since passing to strcpy discards the unsigned qualifier, the compiler warns.

As a general rule, don't make things unsigned when you don't have to.

Engineer2021
  • 3,288
  • 6
  • 29
  • 51
  • @dmcr_code: If this was a C compiler, it shouldn't care. – Engineer2021 Apr 03 '14 at 20:40
  • 1
    As a general rule, also don't make things signed when you don't have to. – 4pie0 Apr 03 '14 at 20:41
  • @staticx so just to answer short: will strcpy works correct in this case or not? – 4pie0 Apr 03 '14 at 20:41
  • And maybe also strtok? –  Apr 03 '14 at 20:42
  • There are libraries that use `unsigned char*` as strings. SQLite is an example. Now, why does `char` has signed/unsigned counterparts in the firtst place? – Joker_vD Apr 03 '14 at 20:51
  • @Joker_vD: Because it's a small integral type, useful for math. Math which acts differently depending on signedness. For example, `'\x80' < '\x00'` is well-defined but unspecified. – Ben Voigt Apr 03 '14 at 20:53
  • 1
    `strcpy` and all other C string / char handling functions are defined to work as if on `unsigned char`. So, no problem. – Deduplicator Apr 03 '14 at 20:55
  • "In C strings are char arrays" is not supported by the C spec which says "A string is a contiguous sequence of characters terminated by and including the first null character." If the spec meant to say `char` rather than `characters`, it would have said that. Further §6.2.5 15 "The three types char, signed char, and unsigned char are collectively called the character types." – chux - Reinstate Monica Apr 03 '14 at 21:16
  • @BenVoigt I know about an architecture where `char` was 32 bit wide, so... Anyway, I was just saying that `char` is not a `byte`, it's more like `void*`: `'a'+'b'` is weird, `'b'-'a'` is `1` and reasonable, for example. – Joker_vD Apr 03 '14 at 23:09
  • @chux: I don't know about C, but in C++, narrow string literals have type "array of *n* `const char`". Specifically `char`, and not an indeterminate character type. See 2.14.5p8 – Ben Voigt Apr 03 '14 at 23:26
  • @Ben Voigt Interesting. Came across C++ 21.4 1 "The class template basic_string describes objects that can store a sequence consisting of a varying number of arbitrary char-like objects with the first element of the sequence at position zero. Such a sequence is also called a “string” if the type of the char-like objects that it holds is clear from context." Also C string may match C++ 17.5.2.1.4.1 Byte strings "A null-terminated byte string, or ntbs, is a character sequence whose highest-addressed element with defined content has the value zero (the terminating null character); ..." – chux - Reinstate Monica Apr 04 '14 at 01:46
  • @Chux: Yes, those definitions allow algorithms designed for strings to be used on all kinds of other data types. And you can even make a user-defined string literal of any data type you want, because by happy accident things like "find the first element equal to zero (strlen)" work just as well for e.g. `float` as for `char`. Still, the simple string literal `"abcd"` gives you `char`s. And `std::string` always means `std::basic_string`. The bit there with "called a string if the type is clear from context" means say "string of `unsigned char`" is perfectly valid. – Ben Voigt Apr 04 '14 at 02:17
  • @Ben Voigt The `float-string` example has a curious nuance in that there are in most `float` representations 2 distinct zeros: +0.0 and -0.0 which `==` but not `memcmp()`. I suppose that is the same with 0 in old-school `char` using 1's or sign-magnitude. – chux - Reinstate Monica Apr 04 '14 at 02:25
  • @chux: What was your point with the first comment? e.g., this one: "In C strings are char arrays" is not supported by the C spec which says "A string is a contiguous sequence of characters terminated by and including the first null character." If the spec meant to say char rather than characters, it would have said that ...." –  Apr 04 '14 at 07:35
  • @dmcr_code The point of my first commented addressed the answer posted here by @staticx who asserted "In C strings are char arrays, not unsigned char arrays". My read of the C spec came to a different conclusion: C spec §7.1.1 1 "A string is a contiguous sequence of characters..." and §6.2.5 15 "The three types char, signed char, and unsigned char are collectively called the character types." Together, I take this to mean a C string can be composed of either of the 3 types: 'char`, `unsigned char`, `signed char`. – chux - Reinstate Monica Apr 04 '14 at 14:10
  • @chux: a ok I see now, I slowly come to conclusion that I can use `unsigned char*` in (string) functions where `char*` is needed, albeit with a cast .. –  Apr 04 '14 at 14:44