11

As the question says:

typedef __CHAR16_TYPE__ char16_t; 

int main(void)
{
  static char16_t test[] = u"Hello World!\n";

  printf("Length = %d", strlen(test)); // strlen equivalent for char16_t ???

  return 0;
}

I searched and found only C++ solutions.

My compiler is GCC 4.7.

Edit:

To clarify, I was searching for a solution that returns the count of code points, not the count of characters.

These two are quite different for UTF-16 strings containing characters outside the BMP.

Jens Mühlenhoff
  • 14,565
  • 6
  • 56
  • 113
  • 1
    Possibly, it worth to write it by your self? – Alex Jan 25 '13 at 19:03
  • 1
    C11 didn't specify such utility functions for the new character types. There are C++ solutions because of C++'s templates. – bames53 Jan 25 '13 at 19:26
  • If you use `-fshort-wchar`, `wcslen(3)` might work. – Carl Norum Jan 25 '13 at 19:38
  • 1
    @Carl: I think that's a bad idea. Compiler options can't change library functions. Even worse it might appear to work when the compiler inlines a builtin version and fail when the lib function is called... – R.. GitHub STOP HELPING ICE Jan 25 '13 at 19:48
  • Well, the compiler driver is often used as a linker front-end, so it could be made to work. It doesn't on my machine here, though. I'd have to agree that it might be a bad idea. – Carl Norum Jan 25 '13 at 19:54
  • I could have written it myself, but I was hoping for a standard library solution which doesn't exist as it seems. – Jens Mühlenhoff Jul 10 '13 at 09:17
  • Also see https://stackoverflow.com/questions/5818508/c11-char16-t-strlen-equivalent-function – sakra Oct 01 '17 at 12:41

5 Answers5

12

std::char_traits has this.

#include <string>

std::char_traits<char16_t>::length(yourchar16pointerhere);
Raven
  • 1,264
  • 1
  • 12
  • 22
7

Here's your basic strlen:

int strlen16(const char16_t* strarg)
{
   int count = 0;
   if(!strarg)
     return -1; //strarg is NULL pointer
   char16_t* str = strarg;
   while(*str)
   {
      count++;
      str++;
   }
   return count;
}

Here's a more efficient and popular strlen:

int strlen16(const char16_t* strarg)
{
   if(!strarg)
     return -1; //strarg is NULL pointer
   char16_t* str = strarg;
   for(;*str;++str)
     ; // empty body
   return str-strarg;
}

Hope this helps.

Warning: This doesn't work properly when counting the characters (not code points) of a UTF-16 string. This is especially true when __STDC_UTF_16__ is defined to 1.

UTF-16 is variable length (2 bytes per character in the BMP or 4 bytes per character outside the BMP) and that is not covered by these functions.

Jens Mühlenhoff
  • 14,565
  • 6
  • 56
  • 113
askmish
  • 6,464
  • 23
  • 42
  • I think you mean `while(*str)`. – aschepler Jan 25 '13 at 19:14
  • opps typo. Thanks for informing. :) – askmish Jan 25 '13 at 19:18
  • 1
    There's no point to maintain a separate count. At the end you can simply return `str - strarg`. – bames53 Jan 25 '13 at 19:31
  • yes, I thought that I would write that, but then, it would be the exact implementation source, instead of a basic one. :) Still, I am adding it, now that you've mentioned. – askmish Jan 25 '13 at 19:34
  • 3
    Null check is not needed or useful. – R.. GitHub STOP HELPING ICE Jan 25 '13 at 19:50
  • WRONG WRONG!!! WRONG!! TOTALLY WRONG!!! How does this have ANY up votes only god knows. This function fails HORRIBLY on EVERYTHING besides the basic multilingual plane! This does NOT correctly calculates the length of a UTF-16 encoded string!! UTF-16 is variable length, this is a total disservice to anyone searching for a correct implementation. (Assumed `__STDC_UTF_16__`, if this is not the case, this answer should make a VERY bold and noticeable note about that) – Wiz Apr 05 '13 at 00:53
  • 2
    @Wiz That depends on what you're expecting of a strlen for a unicode string. I accepted this answer, because it does what I was looking for. You are right that one should be aware of surrogates though. – Jens Mühlenhoff Jul 10 '13 at 09:14
  • Getting the character length of a UTF-16 string would of cause require to either convert to UCS-4 or counting surrogate pairs as one character. – Jens Mühlenhoff Jul 10 '13 at 09:21
  • I've added some clarification about characters outside the BMP to the question. – Jens Mühlenhoff Jul 10 '13 at 09:51
  • @JensMühlenhoff The problem is that this code makes its way into other programs with the expectation that char16_t (assuming __STDC_UTF_16__ is 1, of course) is a UTF-16 string, not some small subset of the Unicode within the basic multilingual plane. Nothing in the answer even mentions the limitation (I.E. a total flaw) of this implementation. Even a small note about it would be appropriate. – Wiz Jul 10 '13 at 17:55
  • @JensMühlenhoff Appreciate your support. :) – askmish Sep 24 '13 at 05:43
  • @Wiz Appreciate your thoughts. – askmish Sep 24 '13 at 05:43
  • 1
    `strlen()` retruns `size_t`. Why use `int` here? – chux - Reinstate Monica Mar 31 '15 at 20:16
3
#include <string.h>
#include <wchar.h>
#include <uchar.h>

#define char8_t char
#define strlen8 strlen
#define strlen16 strlen16
#define strlen32(s) wcslen((const wchar_t*)s)

static inline size_t strlen16(register const char16_t * string) {
    if (!string) return 0;
    register size_t len = 0;
    while(string[len++]);
    return len;
}

You should expect the number of char16_t characters to be returned, as opposed to byte count.

Optimized 32-Bit Intel Atom Assembly View:

gcc -Wpedantic -std=iso9899:2011 -g3 -O2 -MMD -faggressive-loop-optimizations -fkeep-inline-functions -march=atom -mtune=atom -fomit-frame-pointer -mssse3 -mieee-fp -mfpmath=sse -fexcess-precision=fast -mpush-args -mhard-float -fPIC ...

.Ltext0:
    .p2align 4,,15
    .type   strlen16, @function
strlen16:
.LFB20:
    .cfi_startproc
.LVL0:
    mov edx, DWORD PTR 4[esp]
    xor eax, eax
    test    edx, edx
    je  .L4
    .p2align 4,,15
.L3:
.LVL1:
    lea eax, 1[eax]
.LVL2:
    cmp WORD PTR -2[edx+eax*2], 0
    jne .L3
    ret
.LVL3:
    .p2align 4,,7
    .p2align 3
.L4:
    ret
    .cfi_endproc
.LFE20:
    .size   strlen16, .-strlen16

Here an Intel disassembly:

static inline size_t strlen16(register const char16_t * string) {
   0:   8b 54 24 04             mov    edx,DWORD PTR [esp+0x4]
    if (!string) return 0;
   4:   31 c0                   xor    eax,eax
   6:   85 d2                   test   edx,edx
   8:   74 16                   je     20 <strlen16+0x20>
   a:   8d b6 00 00 00 00       lea    esi,[esi+0x0]
    register size_t len = 0;
    while(string[len++]);
  10:   8d 40 01                lea    eax,[eax+0x1]
  13:   66 83 7c 42 fe 00       cmp    WORD PTR [edx+eax*2-0x2],0x0
  19:   75 f5                   jne    10 <strlen16+0x10>
  1b:   c3                      ret    
  1c:   8d 74 26 00             lea    esi,[esi+eiz*1+0x0]
    return len;
}
  20:   c3                      ret    
  21:   eb 0d                   jmp    30 <AnonymousFunction0>
  23:   90                      nop
  24:   90                      nop
  25:   90                      nop
  26:   90                      nop
  27:   90                      nop
  28:   90                      nop
  29:   90                      nop
  2a:   90                      nop
  2b:   90                      nop
  2c:   90                      nop
  2d:   90                      nop
  2e:   90                      nop
  2f:   90                      nop
0

You need to read 2 bytes and check if both of them are zeros, as in unicode first byte can be zero.

Not a perfect solution (actually a kind of weird solution):

size_t strlen16(const char16_t* str16) {
    size_t result = 0;
    char* strptr = (char*) str16;
    char byte0, byte1;

    if(str16 == NULL) return result;

    byte0 = *strptr;
    byte1 = *(strptr + 1);

    while(byte0|byte1) {
        strptr += 2;
        byte0 = *strptr;
        byte1 = *(strptr + 1);
        result++;
    }
    return result;
}
Alex
  • 9,891
  • 11
  • 53
  • 87
  • 2
    You don't need to explicitly check each byte, you can simply check if an entire `char16_t` is equal to 0; `x == 0` or `x == u'\0'`. Or if the expression is in a context that gets converted to bool you can rely on the fact that `u'\0'` is converted to false; `while(*str16) str16++;` etc. – bames53 Jan 25 '13 at 19:33
0

On Windows, there is wcslen().

Regardless of the platform, better not use char16_t. I believe it is a blunder on the part of the standard committee to have it in the language.

Pavel Radzivilovsky
  • 18,794
  • 5
  • 57
  • 67
  • 5
    `wcslen()` works with `wchar_t`, not `char16_t`. And how is having a standard type for representing UTF-16 code units be a blunder? – bames53 Jan 25 '13 at 21:33
  • First, on Windows (which is where the function is) it is practically the same. You can cast one pointer to the other. Second, the word 'blunder' above is a hyperlink. Please be welcome to follow the link :) – Pavel Radzivilovsky Jan 25 '13 at 22:39
  • Even on Windows casting between those types is a violation of strict aliasing rules. "Blunder" links to the _UTF-8 Everywhere_ page, and I certainly agree that using UTF-8 everywhere is best, but it doesn't argue that there should be no standard way to represent UTF-16 code units. – bames53 Jan 25 '13 at 22:56
  • I don't agree that using UTF-8 everywhere is "best", because what's best is not up to the compiler, library or C standard, but to the specific use case. I really like the idea of the new char16_t type, sadly the library support is not at par with char or wchar_t. – Jens Mühlenhoff Jan 30 '13 at 11:36