c++ How to manipulate and work with UTF-8 characters

Question

I am trying to read a UTF-8 encoded .txt file and need to do validations on it.

I am working on Windows 10 even though I need the solution to work the same way on Linux. I work with Dev c++ 6.3, TDM-GCC 9.2.0 64-bit Compiler and I am compiling with GNU C++11

At the moment I am reading the following .txt file:

Inicio
D1
Biatlón
S1
255
E1
Esprint 7,5 km (M); 100; 200
E2
Persecucion 10 km (M); 100; 200
ff

This is my code:

#include <iostream>
#include <locale.h>
#include <locale>
#include<fstream>
#include<string>
#include <windows.h>
#define CP_UTF8 65001 
#define CP_UTF32 12000 

#include <codecvt>

using std::cout;

std::wstring utf8_to_ws(std::string const&);

int main(){
    
    std::ifstream file;
    std::string text;

    if (!SetConsoleOutputCP(CP_UTF8)) {
        std::cerr << "error: UTF-8 codigo.\n";
        return 1;
    } 

    file.open("entryDisciplineESP.txt");
    
    int line = 0;
    
    if (file.fail()){
        
        cout<<"Error. \n";
        
        exit(1);
        
    }
    
    while(std::getline(file,text)){ 
        
        if(linea == 2){

            std::cout<<text[5]<<"\n";
            auto a = utf8_to_ws(text);
            std::wcout<<a<<"\n";
            
        }
        
        std::cout<<text<<"\n";
        
        line++;
        
    }
    
    cout<<"\n";
    
    system("Pause");
    return 0;
}



std::wstring utf8_to_ws(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::wstring s = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return s;
}

And I am receiving the following by console:

Inicio
D1

Biatln
Biatlón
S1
255
E1
Esprint 7,5 km (M); 100; 200
E2
Persecucion 10 km (M); 100; 200
ff

If I print the file on the screen, I receive the character "ó" but not separately, I need to interact with that character to do validations, I need to check that there are no numbers or special characters on that line: "!,?,:" etc. I also need to save that name in a string and be able to interact with it and display results on the console.

Thanks in advance.

Does this answer your question? [c++ how can i work and validate a utf8 character](https://stackoverflow.com/questions/71804573/c-how-can-i-work-and-validate-a-utf8-character) — Retired Ninja, Apr 10 '22 at 23:48
Ok, you described what you need, but what is your specific C++ question, for Stackoverflow? — Sam Varshavchik, Apr 10 '22 at 23:48
@RetiredNinja No, I've updated the code and I still don't get what I want. — ramej, Apr 10 '22 at 23:59
@SamVarshavchik know how to manipulate the UTF-8 character to validate it. — ramej, Apr 11 '22 at 00:00
There is no question in your post. `I need to check that there are no numbers or special characters on that line` so do it, so check it, so interact with it, what is stopping you? `if(linea == 2){` has a compile error. `but not separately` I do not understand that sentence, you do not receive the character separately? They are all in the file, you receive them together. — KamilCuk, Apr 11 '22 at 00:04
@KamilCuk As I said in the publication, I need to go through the string character by character to validate it and check that it does not have any illegal characters, what I am trying to show is that I do not get this, I simply do not get the character separately, therefore I cannot do the validation. — ramej, Apr 11 '22 at 00:16
There's a Wikiepedia page that explains how to encode and decode UTF-8. Are you familiar with how UTF-8 characters are encoded? That's the first thing to learn. Unfortunately, UTF-8 support in the C++ library is rather slim, and most applications use some other third party Unicode libraries to deal with it. I happened to have written one, that implements converting between UTF-8 and UTF-32, and implements the UTF-32 versions of alnum(), isdigit() etc... Unfortunately library recommendations are off-topic for Stackoverflow, so I won't. But Google is down the hall, last door on the left... — Sam Varshavchik, Apr 11 '22 at 00:34
When you've asked essentially the same question several times and aren't getting anywhere you might consider that you're asking the wrong question or trying to solve the problem the wrong way. You might consider using a library like https://github.com/nemtrif/utfcpp to help you work with utf-8 code points. — Retired Ninja, Apr 11 '22 at 00:38
If you are trying to get it to work on windows: https://learn.microsoft.com/en-us/archive/msdn-magazine/2016/september/c-unicode-encoding-conversions-with-stl-strings-and-win32-apis — Jerry Jeremiah, Apr 11 '22 at 01:47

score 0 · Answer 1 · answered Apr 11 '22 at 04:27

There are a whole can of worms to dealing with Unicode on Windows, alas, but your main problem (for this example) is that you are treating a Unicode “character” (a code-point) as if it were a single byte entity. (And you are doing it before you have converted to wide string!)

When dealing with UTF-8, you no longer have that luxury. Once you admit UTF-8, everything is a string. Even single code points.

In other words, every “character” must now be treated as a 1–4 byte string.

Thus, to print ó to any stream (not just the terminal), you must print the proper multi-byte UTF-8 code sequence "ó". Notice it is a string. (It is not a one-byte char.)

Hint 1: If you want to be safe on all versions of Windows using whatever output console/terminal your program is attached to, use the Windows wide character Console API routines for all output. You can easily set the rdbuf for the standard streams to use a custom UTF-8 → UTF-16 convert-and-print buffer for those that are attached to the console/terminal, and leave them alone otherwise.

Hint 2: Every modern system has an ICU database on it that you can use: Linux, Windows, Android, iOS, etc. Use it to deal with UTF-encoding conversions.

c++ How to manipulate and work with UTF-8 characters

1 Answers1