1

I am using the pcre2 library, and it has a special 'string-type' defined as PCRE2_SPTR8.

If I try and initialize a string with something like:

PCRE2_SPTR8 s =   "my string";

I'll get a warning such as:

warning: initializing 'PCRE2_SPTR8' (aka 'const unsigned char *') 
         with an expression of type 'char [27]' converts between pointers to
         integer types with different sign [-Wpointer-sign]

What would be the suggested way to initialize this type of string? Doing something like:

PCRE2_SPTR8 s =   (PCRE2_SPTR8) "my string";

Additionally, out of curiosity why is a normal "string" usually defined as char* string = "something"; instead of unsigned char* string = "something";. Is there any advantage/disadvantage of defining a string with a signed vs. unsigned char?

carl.hiass
  • 1,526
  • 1
  • 6
  • 26
  • `char` is weird in that compilers can treat it as either default signed, or default unsigned because *reasons*. As such, your character constants will inherit this and forcing unsigned might break things. – tadman Apr 11 '21 at 01:22
  • @tadman what are a few of the reasons you can think of? – carl.hiass Apr 11 '21 at 01:38
  • Differences of opinion now fossilized into the C standard. – tadman Apr 11 '21 at 01:41
  • 1
    Because the C-Standard leaves whether `char` is signed or unsigned up to the implementation. See [Is char signed or unsigned by default?](https://stackoverflow.com/q/2054939/3422102) (just another way they have found to torment C programmers `:)` – David C. Rankin Apr 11 '21 at 01:41
  • @carl.hiass In C, `int, long, short` without `signed` is _signed_. To extend that idea to `char` makes it also _signed_ - so seems natural. Yet signed `char` is very problematic for non-2's complement as now there are two 0 encodings (or worse a trap). Such machines benefitted with `char` as unsigned. Also [EBCDIC](https://en.wikipedia.org/wiki/EBCDIC) used a 0-255 character encoding unlike ASCII 0-127, and benefits with an unsigned `char`. C compromise: `char` is a type different than `signed char` and `unsigned char` and matchings the range/size of one of those. – chux - Reinstate Monica Apr 11 '21 at 02:21

1 Answers1

2

What would be the suggested way to initialize this type of string?

Something like OP's idea when PCRE2_SPTR8 is not char *, but it is more common to avoid hiding the *. As this is a style issue - follow your group's style guide.

// PCRE2_SPTR8 s =   (PCRE2_SPTR8) "my string";
const unsigned char * =  (unsigned char *) "my string";

Additionally, out of curiosity why is a normal "string" usually defined as char* string = "something"; instead of unsigned char* string = "something";.

In C, a string is defined by the standard library as:

A string is a contiguous sequence of characters terminated by and including the first null character.

It is best to stay close to that definition. char* string is not a string, but a pointer to a string. Like-wise for unsigned char* string.


Is there any advantage/disadvantage of defining a string with a signed vs. unsigned char?

The C library of string functions behave as if the string elements were unsigned char.

For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value).

This is important in select case like in strcmp() and comparing two characters, maybe one outside the ASCII range. The values are compared as if unsigned char. Also important on rare platforms today that do not use 2's complement.

When implementing a string-like function, then best to implement with unsigned char.

When calling string functions, best to stay with char to minimize need for casting.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256