2

I want to read Unicode file (UTF-8) character by character, but I don't know how to read from a file one by one character.

Can anyone to tell me how to do that?

informatik01
  • 16,038
  • 10
  • 74
  • 104
John
  • 21
  • 1
  • 3
  • You want to read individual Unicode characters or utf-8 bytes? – Dietmar Kühl Jan 07 '12 at 02:24
  • Read the file, then convert UTF-8 to UTF-32. You can either use `iconv()`, libicu, or C++11. – Kerrek SB Jan 07 '12 at 02:27
  • 1
    @Kerrek SB does C++11 include this? What class or function should we look for? –  Jan 07 '12 at 02:33
  • @WTP: It should be in ``, and it's actually coming in from the C99 support. There's definitely UTF16 <-> UTF32 support; I'm not 100% sure right now if there's also UTF8 support. – Kerrek SB Jan 07 '12 at 02:37
  • C++11 does have UTF-8 support. `codecvt` converts between UTF-8 and UTF-32. You can use it with `wstring_convert` like so: `wstring_convert,char32_t> convert; u32string s = convert.from_bytes("foo");` – bames53 Jan 07 '12 at 08:35
  • One thing to keep in mind is that Unicode codepoints are not necessarily characters. If you iterate through a string treating codepoints as characters you may fail to handle characters that are composed of multiple codepoints correctly. E.g. if you try to reverse a string of characters by reversing the codepoints you will corrupt combining codepoint sequences. – bames53 Jan 07 '12 at 08:41
  • Oh, and MSVC 2010 doesn't yet support the `char16_t` or `char32_t` specializations of `std::codecvt`. It does support `codecvt_utf8` though. Here's an answer with more details: http://stackoverflow.com/a/7235204/365496 – bames53 Jan 07 '12 at 08:47

4 Answers4

4

First, look at how UTF-8 encodes characters: http://en.wikipedia.org/wiki/UTF-8#Description

Each Unicode character is encoded to one or more UTF-8 byte. After you read first next byte in the file, according to that table:

(Row 1) If the most significant bit is 0 (char & 0x80 == 0) you have your character.

(Row 2) If the three most significant bits are 110 (char & 0xE0 == 0xc0), you have to read another byte, and the bits 4,3,2 of the first UTF-8 byte (110YYYyy) make the first byte of the Unicode character (00000YYY) and the two least significant bits with 6 least significant bits of the next byte (10xxxxxx) make the second byte of the Unicode character (yyxxxxxx); You can do the bit arithmetic using shifts and logical operators of C/C++ easily:

UnicodeByte1 =   (UTF8Byte1 << 3) & 0xE0;
UnicodeByte2 = ( (UTF8Byte1 << 6) & 0xC0 ) | (UTF8Byte2 & 0x3F);

And so on...

Sounds a bit complicated, but it's not difficult if you know how to modify the bits to put them in proper place to decode a UTF-8 string.

Hossein
  • 4,097
  • 2
  • 24
  • 46
  • 1
    To take it a step farther, the first byte in a UTF-8 byte sequence tells you how many additional bytes are in the sequence. – Remy Lebeau Jan 07 '12 at 23:00
3

UTF-8 is ASCII compatible, so you can read a UTF-8 file like you would an ASCII file. The C++ way to read a whole file into a string is:

#include <iostream>
#include <string>
#include <fstream>

std::ifstream fs("my_file.txt");
std::string content((std::istreambuf_iterator<char>(fs)), std::istreambuf_iterator<char>());

The resultant string has characters corresponding to UTF-8 bytes. you could loop through it like so:

for (std::string::iterator i = content.begin(); i != content.end(); ++i) {
    char nextChar = *i;
    // do stuff here.
}

Alternatively, you could open the file in binary mode, and then move through each byte that way:

std::ifstream fs("my_file.txt", std::ifstream::binary);
if (fs.is_open()) {
    char nextChar;
    while (fs.good()) {
        fs >> nextChar;
        // do stuff here.
    }
}

If you want to do more complicated things, I suggest you take a peek at Qt. I've found it rather useful for this sort of stuff. At least, less painful than ICU, for doing largely practical things.

QFile file;
if (file.open("my_file.text") {
    QTextStream in(&file);
    in.setCodec("UTF-8")
    QString contents = in.readAll();

    return;
}
informatik01
  • 16,038
  • 10
  • 74
  • 104
Liam M
  • 5,306
  • 4
  • 39
  • 55
  • Your solution does not output letters, but bytes. This works only for the ASCII part of the utf-8 character set. – Jindra Helcl Jan 10 '15 at 13:23
  • @JindraHelcl My solution doesn't output anything: it reads a file and makes the data in that file available for further processing. The asker never specified whether he wanted to read the bytes in the file (which my solution answers) or read the characters in file (which I've shown how to do, using Qt). Keep in mind, this answer is 3 years old. – Liam M Jan 27 '15 at 04:09
1

In theory strlib.h has a function mblen which shell return length of multibyte symbol. But in my case it returns -1 for first byte of multibyte symbol and continue it returns all time. So I write the following:

{
    if(i_ch == nullptr) return -1;
    int l = 0;
    char ch = *i_ch;
    int mask = 0x80;
    while(ch & mask) {
        l++;
        mask = (mask >> 1);
    }
    if (l < 4) return -1;
    return l;
}  

It's take less time than research how shell using mblen.

zessx
  • 68,042
  • 28
  • 135
  • 158
Andrey
  • 11
  • 1
-2

try this: get the file and then loop through the text based on it's length

Pseudocode:

String s = file.toString();
int len = s.length();
for(int i=0; i < len; i++)
{
    String the_character = s[i].

    // TODO : Do your thing :o)
}
pxp
  • 87
  • 1
  • 10