Recognizing lithuanian letters from fstream in C++

Question

I have gotten a task from my IT teacher to find out how many letters, figures, whitespaces and other symbols there are in the given text. The problem is that the text is written with lithuanian letters (Š, š, Ę, ę, Ų, ų, etc.) and I don't know how to recognize them in C++. To calculate the count of each type of symbol I read the text line by line with getline() function from an fstream to a string and then iterate through the string comparing each character with its literal, for example (c >= 'A' && c <= 'Z') means that it's an uppercase letter, but it doesn't work with lithuanian characters. I guess the text file is saved in Unicode format. Please help me to recognize lithuanian letters in the text.

Have you thought about printing the 'unrecognized' values so you can see how they are encoded? If you can see that, you also know how to recognize them. — Jongware, Oct 26 '14 at 11:35
Convert your input into UTF-32, normalize, and then search for substrings. Some of this requires specialist text processing libraries (e.g. [ogonek](https://github.com/rmartinho/ogonek), or ICU). Standard C++ can help you with [``](http://stackoverflow.com/q/7562609/596781) and [system encodings](http://stackoverflow.com/q/6300804/596781) and [codecvt](http://en.cppreference.com/w/cpp/locale/codecvt), but you still need to do the actual text work yourself. — Kerrek SB, Oct 26 '14 at 11:37
@Jongware Ofcourse I have tried this but it didn't work. I have tried with `char` type and it gave me negative values and I have also tried with `wchar_t` type and it gave me values like `65488` but when comparing these values with characters from the string it didn't work... — Salivan, Oct 26 '14 at 11:37
@KerrekSB This seems relatively hard as an option for such a short task. Isn't there a simpler way? — Salivan, Oct 26 '14 at 11:39
Inspect your text file with a hex editor. This will determine if it is "Unicode" (with which, I presume, you mean 16-bit or 32-bit wide character codes) or -- more likely -- UTF8. For the latter you will see ASCII characters 'as usual' but your Lithuanian characters will consist of two of more bytes: http://www.fileformat.info/info/unicode/char/0119/index.htm (the UTF8 entry). — Jongware, Oct 26 '14 at 11:42
@Jongware Well, I could just select a file format in my Visual Studio IDE as it doesn't have to be a specific type. So you think I should convert it to UTF-8? — Salivan, Oct 26 '14 at 11:44
@Jongware How to read the characters if they are consisting of different number of bytes (1, 2, 3)? I mean I can't use `char` or `char16_t` or `char32_t`. I must use them together somehow? How to determine the number of bytes the next character in the string is consisting of? — Salivan, Oct 26 '14 at 11:52
Which format does your input text come in? Is it a documented encoding, or is it the opaque "system encoding" that you get, say, when you read keyboard entries from the standard input? — Kerrek SB, Oct 26 '14 at 12:25
@KerrekSB I don't know exactly. The text file is encoded in Unicode and I use `fstream` to manage it. — Salivan, Oct 26 '14 at 12:28
@Salivan: Unicode is only an abstract mapping of numbers to meaning. not serialization format. What is the actual data representation on disk? UTF-8, UTF-12, Magic-Homebrew 2.0? — Kerrek SB, Oct 26 '14 at 12:30
@KerrekSB Sorry but I don't know what are you talking about exactly. I am using Virtual Studio IDE on Windows 8 machine. — Salivan, Oct 26 '14 at 12:32

score 0 · Answer 1 · answered Oct 26 '14 at 12:01

0

I think you probably have to open your file binary, like (fileName, ios::in | ios::binary); and read the file byte by byte

answered Oct 26 '14 at 12:01

Phyp

3
1
4

But, as Jongware commented under my question, characters might be consisting of two or more bytes. So what's then? I can't rely on reading the file byte by byte then... – Salivan Oct 26 '14 at 12:03

score 0 · Answer 2 · edited May 23 '17 at 10:33

As I understand your text stored in utf-8 encoding. If it was utf-16 or utf-32 - your getline() function would almost always return one or zero symbols and I think you would noticed this. UTF-8 described here: https://ru.wikipedia.org/wiki/UTF-8. You can use standart library to convert utf-8 string to wstring: UTF8 to/from wide char conversion in STL . Then you can use map < wchar, int > to calculate count of different symbols.

score 0 · Answer 3 · answered Oct 26 '14 at 12:15

0

I had to manage utf8 and ended up using utf8-cpp

For all practical utf8 related problems, I recommend reading this:

utf8 everywhere

answered Oct 26 '14 at 12:15

Germán Diago

7,473
1
36
59

Recognizing lithuanian letters from fstream in C++

3 Answers3