-2

Section#1

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

int main(int argc, char **argv)
{
    static const unsigned char text[] = "000ßh123456789";
    int32_t current=1;
    int32_t text_len = strlen(text)-1;
    /////////////////////////////////
    printf("Result : %s\n",text);
    /////////////////////////////////
    printf("Lenght : %d\n",text_len);
    /////////////////////////////////
    printf("Index0 : %c\n",text[0]);
    printf("Index1 : %c\n",text[1]);
    printf("Index2 : %c\n",text[2]);
    printf("Index3 : %c\n",text[3]);//==> why show this `�`?
    printf("Index4 : %c\n",text[4]);//==> why show this `�`?
    printf("Index0 : %c\n",text[5]);
    /////////////////////////////////
    return 0;
}

why text[3] and text[4] show ?

how can also support utf-8 character in Index?


Section#2

I want write a function like mb_substr in php.

(verybigstring or string) mb_substr ( (verybigstring or string) input , (verybigint or int) start [, (verybigint or int) $length = NULL ] )

Some Example:

  • mb_substr("hello world",0);

    ==>hello world

  • mb_substr("hello world",1);

    ==>ello world

  • mb_substr_two("hello world",1,3);

    ==>el

  • mb_substr("hello world",-3);

    ==>rld

  • mb_substr_two("hello world",-3,2);

    ==>rldhe

My Question is Section#1

Can anyone help me?(please)

Community
  • 1
  • 1
GoWorkCode
  • 11
  • 8
  • 1
    #1 That doesn't look like an ASCII char. #2 You need to try something yourself. SO is not a code writing service. – John3136 Apr 18 '17 at 22:46
  • in #1, how can get a utf-8 char string as index? – GoWorkCode Apr 18 '17 at 23:03
  • 1) One question per question! 2) Unclear. 3) We are not a coding/tutoring/consulting service. – too honest for this site Apr 18 '17 at 23:29
  • Since your input is encoded using UTF-8 (I presume), your code will need to be UTF-8-aware. Alternative, you can start by "decoding" the UTF-8 bytes into an array of Unicode Code Points (stored in `int32_t` or `uint32_t`), and work with the array. (This requires more memory, but simplifies the problem.) – ikegami Apr 18 '17 at 23:51
  • my Problem & my question is #1. – GoWorkCode Apr 19 '17 at 11:33

1 Answers1

1

The Unicode character set currently includes more than 128,000 characters (which I shall henceforth call Code Points to avoid confusion) with space reserved for far, far more. As such, a char which is only 8 bits in size on modern general-computing machines can't be used to contain a Code Point.

UTF-8 is a way of encoding these Code Points into bytes. The following are the bytes you placed in text[] (assuming UTF-8 was used to encode the Code Points) and what they represent:

i:             0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
text[i]:    0x30 30 30 C3 9F 68 31 32 33 34 35 36 37 38 39 00
              -- -- -- ----- -- -- -- -- -- -- -- -- -- -- --
Code Point: U+30 30 30    DF 68 31 32 33 34 35 36 37 38 39  0
Graph:         0  0  0     ß  h  1  2  3  4  5  6  7  8  9

As you can see, UTF-8 is a variable-width encoding. A single Code Points encodes to a variable number of bytes. This means you can't translate indexes-into-text into indexes-into-array-of-bytes without scanning the array.

A Code Point encoded using UTF-8 starts with

0b0xxxxxxx    Represents an entire Code Point
0b110xxxxx    The start of a 2-byte sequence
0b1110xxxx    The start of a 3-byte sequence
0b11110xxx    The start of a 4-byte sequence

The only other form of bytes you will encounter in UTF-8 is

0b10xxxxxx    A continuation byte (the 2nd, 3rd or 4th byte of sequence)

A simple way to find the nth Code Point in a string (if you assume the input is valid UTF-8) is to search for the nth char for which (ch & 0xC0) != 0xC0 is true. You can use the same approach to count the number of Code Points in a string.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Good,I want make Section2, but My Question is Section1, please help me to fix #1. – GoWorkCode Apr 19 '17 at 11:32
  • I already did. I showed you how to find the start of a Code Point. That means you also know how to find the start of the next Code Point. With that, you have the starting indexes of the Code Point and the length of the Code Point. So all you have to do is print those bytes. – ikegami Apr 19 '17 at 14:30
  • tank you, i should use `wchar_t` type for string? – GoWorkCode Apr 19 '17 at 14:31
  • If you want, you can deal with an array of Code Points instead of dealing with UTF-8 by decoding the UTF-8 bytes (stored in `char`) into Code Points (`wchar_t`) and vice-versa on output. – ikegami Apr 19 '17 at 14:33
  • speed for me is important. – GoWorkCode Apr 19 '17 at 14:38