1

Can someone please explain why the 1st function works but the 2nd doesn't?

unsigned int utf8_count(char* in)
{
    unsigned int i = 0, c = 0;
    while (in[i])
    {
        if ((in[i] & 0xc0) != 0x80)
            c++;

        i++;
    }

    return c;
}

unsigned int utf8_count(char* in, unsigned int in_size)
{
    unsigned int i = 0, c = 0;
    while (i < in_size)
    {
        if ((in[i] & 0xc0) != 0x80)
            c++;

        i++;
    }

    return c;
}

I understand what (in[i] & 0xc0) != 0x80 does but I don't understand why i < in_size != in[i]?

Example string: ゴールデンタイムラバー/スキマスイッチ 57 bytes, 19 characters.

Why utf8_count(in, 57) return 57 and not 19?

The binary representation of the example string:

enter image description here

Luka
  • 1,761
  • 2
  • 19
  • 30
  • 1
    What are you passing in for `in_size`? If you pass in `strlen(in)` the two functions are equivalent. – TypeIA Mar 10 '14 at 20:08
  • I pass the size in bytes. Why strlen returns the size in characters? – Luka Mar 10 '14 at 20:10
  • `strlen` returns the number of non-NULL *bytes* encountered before the trailing NULL. When you say "the size in bytes" what do you mean exactly? Where did the size come from / how is it calculated? – TypeIA Mar 10 '14 at 20:12
  • 1
    For example: `ゴールデンタイムラバー/スキマスイッチ` is 57 bytes or 19 characters. – Luka Mar 10 '14 at 20:12
  • Wherever you're calling the two-arg version of `utf8_count`, replace the second parameter with `strlen(in)`. I think you'll find it behaves just like your single-arg version. Then, find out why whatever you're passing in is not equal to `strlen(in)`. – TypeIA Mar 10 '14 at 20:14
  • Why strlen returns 19 and not 57 in the example string and why the function doesn't work with 59 as input, please? – Luka Mar 10 '14 at 20:15
  • 1
    @Luka - there are no ready-made functions that return this count? What compiler and OS are you using? – PaulMcKenzie Mar 10 '14 at 20:17
  • 1
    `strlen` on your example string should return 57 (or 59, whatever the length in bytes is), *not* 19. The name is a bit of a misnomer. It is not aware of UTF8 or any other encoding; it simply counts the number of non-zero `char` values (= bytes, generally) before it encounters a zero value. – TypeIA Mar 10 '14 at 20:17
  • @PaulMcKenzie: VS2012 under windows – Luka Mar 10 '14 at 20:18
  • @Luka - strlen knows nothing about utf-8. All it knows is that if it encounters a 0 byte, then stop counting. That's how it works. Again, aren't there already system or OS specific functions that do this job already? – PaulMcKenzie Mar 10 '14 at 20:18
  • I am confused, you are basically saying VS12 got a different version of strlen counting the number of characters and not bytes? – Luka Mar 10 '14 at 20:20
  • @Luka, then use the Windows specific functions to determine character length. _tcslen or something like that, but do your research. There is no need to write character length functions -- the world now knows there are different flavors of 8 and 16-bit character sets out there, and the OSes have functions that return proper information. – PaulMcKenzie Mar 10 '14 at 20:20
  • @Luka - strlen() knows nothing about character sets. All it sees is a series of bytes and stops at the first one that equals 0. There is nothing more to read into it than that. – PaulMcKenzie Mar 10 '14 at 20:21
  • I want portability. I want the code to compiler in g++ also. Anyway, let's ignore the fact strlen returns the wrong number of characters, why `utf8_count(in, 57)` return 57 and not 19? – Luka Mar 10 '14 at 20:22
  • See here: http://stackoverflow.com/questions/4063146/getting-the-actual-length-of-a-utf-8-encoded-stdstring – PaulMcKenzie Mar 10 '14 at 20:28
  • Try writing each individual byte value to stdout for each loop and see what you get. Also, ssccee please. – Alan Stokes Mar 10 '14 at 20:33
  • @Luka - You could also print out the value of i just before the return statement, so that you can see how many iterations were done in the while loop. – PaulMcKenzie Mar 10 '14 at 20:38
  • So, the only possible scenario, other than bugs, is the one we have: there are NULL bytes in some positions and strlen stops there. Now I am more confused, why so many NULL bytes? – Luka Mar 10 '14 at 20:46
  • http://stackoverflow.com/a/5117481/412080 – Maxim Egorushkin Mar 10 '14 at 22:12

2 Answers2

2

The issue you're seeing is with your example strings.

Look at ゴールデンタイムラバー/スキマスイッチ Your example bytes show 18x '00111111' before a null byte. By my count the first function should return 18 and the second should return some larger number. Are you sure you're passing in the correct string?

I don't think the bytes you're showing us in the image correspond to the text ゴールデンタイムラバー/スキマスイッチ(if only because I don't see the same character repeated several times at the start of this string.

mamidon
  • 895
  • 1
  • 10
  • 25
1

Works perfectly fine here.. http://ideone.com/oepQg1

I tested it in both CodeBlocks on Windows 8 using g++ 4.8.1 and MSVC 2013. Also tried it on linux.. Works. They both print 19..

So whatever you're feeding it is not the same string that you have in the OP..

// UTF8Test.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <iostream>
#include <cstring>
#include <clocale>

int strlen_u8(const char* str)
{
    int I = 0, J = 0;

    while (str[I])
    {
        if ((str[I] & 0xC0) != 0x80)
        {
            ++J;
        }
        ++I;
    }
    return J;
}

int strlen_s_u8(const char* str, unsigned int size)
{
    unsigned int I = 0, J = 0;
    while (I < size)
    {
        if ((str[I] & 0xC0) != 0x80)
        {
            ++J;
        }
        ++I;
    }
    return J;
}


#if defined _MSC_VER || defined _WIN32 || defined _WIN64
int _tmain(int argc, _TCHAR* argv[])
#else
int main(int argc, char* argv[])
#endif
{
    #ifdef _MSC_VER
    const char* str = "ゴールデンタイムラバー/スキマスイッチ";
    #else
    const char* str = u8"ゴールデンタイムラバー/スキマスイッチ";
    std::setlocale(LC_ALL, "ja_JP.UTF-8");
    #endif

    std::cout << strlen_u8(str) << "\n";
    std::cout << strlen_s_u8(str, strlen(str)) << "\n"; //can use 57 instead of strlen.
    std::cin.get();
}
Brandon
  • 22,723
  • 11
  • 93
  • 186