c++ read file with accents

Question

Good day, I am in a small project where I need to read .txt files, the problem is that some are in English and others in Spanish, the case is being presented in which some information comes with an accent and I must show it on the console with the accent.

I have no problem displaying accents on console with setlocale(LC_CTYPE, "C");

my problem is when reading the .txt file in the reading it does not detect the accents and reads rare characters.

my practice code is:

#include <iostream>
#include <locale.h>
#include<fstream>
#include<string>

using namespace std;

int main(){
    
    setlocale (LC_CTYPE, "C");

    ifstream file;
    string text;
    
    file.open("entryDisciplineESP.txt",ios::in);
    
    if (file.fail()){
        
        cout<<"The file could not be opened."<<endl;
        
        exit(1); 
        
    }
    
    while(!file.eof()){ 

        getline(file,text);
        
        cout<<text<<endl;
        
    }
    
    cout<<endl;
    
    system("Pause");
    return 0;
}

The .txt file in question contains:

Inicio
D1
Biatlón
S1
255
E1
Esprint 7,5 km (M); 100; 200
E2
Persecucion 10 km (M); 100; 200
ff

obviously I'm having problems with 'ó' but in the same way I have other .txt with other characters with accents so I need a solution for all these characters.

Researching I have read and tried to implement wstring and wifstream but I have not been able to implement that successfully.

I'm trying to achieve this on windows, the same way I need the solution to work on linux, at the moment I'm using dev c++ 5.11

Thank you very much in advance for your time and help.

`while (getline(file,text)) { ... }` See [Why !.eof() inside a loop condition is always wrong.](https://stackoverflow.com/q/5605125/9254539) — David C. Rankin, Apr 07 '22 at 03:00

David C. Rankin · Accepted Answer · 2022-04-08T03:40:08.837

2

Your error is how you control your read-loop. See: Why !.eof() inside a loop condition is always wrong. Instead, control your read-loop with the stream-state returned by your read-function, e.g.

    while (getline(file,text)) {
        
        std::cout << text << '\n';
        
    }

The character in question is simple extended ASCII (e.g. c3) and easily representable in std::string and with std::cout. Your full example, fixing Why is “using namespace std;” considered bad practice? would be

#include <iostream>
#include <fstream>
#include <string>

int main() {
    
    setlocale (LC_CTYPE, "C");

    std::ifstream file;
    std::string text;
    
    file.open ("entryDisciplineESP.txt");
    
    if (file.fail()){
        
        std::cerr << "The file could not be opened.\n";
        
        exit(1); 
    }
    
    while (getline(file,text)) {
        
        std::cout << text << '\n';
    }
    
    std::cout.put('\n');
    
#ifdef _WIN32
    system("Pause");
#endif
    return 0;
}

Example Output

$ ./bin/accent_read
Inicio
D1
Biatlón
S1
255
E1
Esprint 7,5 km (M); 100; 200
E2
Persecucion 10 km (M); 100; 200
ff

Windows 10 Using UTF-8 Codepage

The problem you experience attempting to run the above code under Windows 10 console (which I presume is what DevC++ is launching output in), is the default codepage (437 - OEM United States) does not support UTF-8 characters. To change the codepage to UTF-8, you will use (65001 - Unicode (UTF-8)). See Code Page Identifiers

To get the proper output after compiling under VS with the C++17 language standard, all that was needed was to change the codepage using chcp 65001 in the console. (you also must have an UTF-8 font, mine is set to Lucida Console)

Output In Windows Console (Command Prompt) After Setting Codepage

C:\Users\david\source\repos\accents>chcp 65001
Active code page: 65001

C:\Users\david\source\repos\accents>Debug\accents.exe
Inicio
D1
Biatlón
S1
255
E1
Esprint 7,5 km (M); 100; 200
E2
Persecucion 10 km (M); 100; 200
ff

Press any key to continue . . .

You have the additional need to set the codepage programmatically due to DevC++ automatically launching the console. You can do that using SetConsoleOutputCP (65001). For example:

...
#include <windows.h>
...
#define CP_UTF8 65001 

int main () {

    // setlocale (LC_CTYPE, "C");           /* not needed */
    
    /* set console output codepage to UTF-8 */
    if (!SetConsoleOutputCP(CP_UTF8)) {
        std::cerr << "error: unable to set UTF-8 codepage.\n";
        return 1;
    }
    ...

See SetConsoleOutputCP function. The analogous function for setting the input codepage is SetConsoleCP(uint codepage).

Output Using SetConsoleOutputCP()

Setting the console to the default 437 codepage and then using SetConsoleOutputCP (65001) to set output codepage to UTF-8, you get the same thing, e.g.

C:\Users\david\source\repos\accents>chcp 437
Active code page: 437

C:\Users\david\source\repos\accents>Debug\accents.exe
Inicio
D1
Biatlón
S1
255
E1
Esprint 7,5 km (M); 100; 200
E2
Persecucion 10 km (M); 100; 200
ff

Press any key to continue . . .

Also, check the DevC++ project (or program) settings and check whether you can set the output codepage there. (I don't use it, so don't know if it is possible).

edited Apr 08 '22 at 03:40

answered Apr 07 '22 at 03:09

David C. Rankin

81,885
6
58
85

How would you know the encoding? – Passer By Apr 07 '22 at 03:55
@PasserBy you bring up a good point. If the encoding isn't UTF-16 (regardless of byte order mark), or other wide-character type, you would be okay. If it were, types would need to change accordingly. – David C. Rankin Apr 07 '22 at 06:31
Handling the windows terminal Code-Page changes needed are described in [Output unicode strings in Windows console app](https://stackoverflow.com/q/2492077/3422102). You can check/set the terminal code page with `chcp`. The default terminal uses code page `437`, for UTF8 the code page is `65001`. gcc/Mingw will output the accent on either if `SetConsoleCP(CP_UTF8)` and `SetConsoleOutputCP(CP_UTF8)` are set. Though there are conversion needed from `mbtowcs` and the like on the characters on Windows. – David C. Rankin Apr 07 '22 at 07:57
@David C Rankin I have used your recommendations which I appreciate very much but I keep getting the same problem, instead of printing to the console 'ó' I am getting '├│' even compiling and pasting your code in its entirety. – ramej Apr 07 '22 at 16:57
@ramq Which version of windows are you using and which terminal (command prompt or PowerShell?) Windows console native code-page doesn't support accented characters by default. See [Windows Command-Line: Unicode and UTF-8 Output Text Buffer](https://devblogs.microsoft.com/commandline/windows-command-line-unicode-and-utf-8-output-text-buffer/). The other link I posted (two comments above) contains the background on how to implement the code-page change and setting the input and output stream properties for UTF8 in windows console. – David C. Rankin Apr 07 '22 at 19:12
@David C Rankin I use windows 10, and the console I suppose is the default I don't know how I can know which one I am using, I would appreciate knowing how, in the same way I think that the problem itself is not the console since I have used 'setlocale (LC_CTYPE, "") ;' and I have obtained the 'ó' for cin>> and I have printed it on the console without problems, also a new thing is that I have created the same .txt but from visual studio code and if I read that txt if it brings me the 'ó' the problem persists when the txt is created from windows notepad which is exactly what i need to read. – ramej Apr 07 '22 at 20:42
The `setlocale (LC_CTYPE, "") ;` likely isn't needed. I'm on Linux and have a few more hours work to do, I'll have time after that to boot windows. I suspect the dev C++ you are using just uses the native command prompt for terminal output. This is a "windows" problem, not a code problem. The new tabbed console from the windows store evidently handles UTF8 and multibyte characters natively (I haven't tried it). I'll write a windows conversion later tonight and drop an update. – David C. Rankin Apr 07 '22 at 22:13
@David C Rankin My doubt that it is a windows problem is because, as I said before, for a console printout of ' ó ' there is no problem, the problem occurs when the .txt is read. In the same way I appreciate your good help and I will wait for the update in addition to reading other articles. – ramej Apr 07 '22 at 23:08
@ramej - it is exactly as I specified related to the console codepage -- see my update. I just booted Windows 10 and added the additions above (and removed `setlocale()`) I tested (1) by manually setting the console codepage to UTF-8, and (2) by leaving the console codepage at `437` and using `SetConsoleOutputCP()` within the program to set it. Both worked fine. – David C. Rankin Apr 08 '22 at 03:22
Also, I built under MinGW gcc 6.3. Worked exactly the same. I just build from the command line and put the executable in a `bin/` subdirectory to keep the source directory clean. I used `g++ -std=c++14 -O3 -o bin/accent_read_vs accent_read_vs.cpp` and then running `bin\accent_read_vs.exe` gave the same output posted in the update. `chcp` confirmed `"Active code page: 437"`, so setting UTF-8 within the program worked with MinGW-gcc as well. – David C. Rankin Apr 08 '22 at 04:09
If you saved your data file from Notepad, make sure you told it to save as UTF-8 (no BOM). By default Notepad will save in UTF-16LE with BOM and then additional conversions will be required. (sorry that is another possibility on windows) Which is why character sets, codepages and character encodings cause such grief under Windows. – David C. Rankin Apr 08 '22 at 04:19
@David C Rankin I was just about to comment that I realized that the problem itself was that I was trying to read a .txt in UTF-8 format, the solution that you have published has helped me and I have managed to print on the screen without any problem 'ó ', really I hope at some point to have your knowledge and be able to help others, one last thing, how can I go through the string and get the ' ó ' character separately, since when I use text[5] for "Biathlon" or it prints a space blank or prints the character '├ ', thanks for all your help. – ramej Apr 08 '22 at 22:18
@David C Rankin In fact as I said with the variable text if it prints the ' ó ' but if I put text [5] it prints either blank space or an integer 195 which I think corresponds to '├ ' The other characters are printed without any problem, that is text[0] 'B', text[4] 'l', and so on. – ramej Apr 08 '22 at 22:23
I have been working though this problem from several angles. There isn't a one-size solution fits all (especially with DevC++ that tends to use older internals). The way multibyte/wide characters are handled has changed on windows since Win7/VC10 though what we have today. So exactly how you need to code it, will depend. I ran across the following link that also addresses the issue and has yet two-more approaches [eading UTF-8 characters from console](https://stackoverflow.com/q/48176431/3422102). To handle all cases requires C conversion of `mbstowcs_s()` and back. – David C. Rankin Apr 08 '22 at 22:30
@David C Rankin well friend I'll get on it and review what you've happened to me, I appreciate all your help, I hope you have a good day. – ramej Apr 08 '22 at 23:38
Good to hear you are getting closer. In the `reading UTF-8` in the last comment above, pay attention to the `MS_STDLIB_BUGS` define. That applies to earlier versions of VC, MinGW and Clang++. So those implementation are just flat buggy with handling the character set. The answer in that link by @Davistor address some, but not all of the buggy implementations. His solution doesn't help with MinGW 6.3 or the Win7SDK (VS-10) compilers. On Win10 if you add the MinGW\bin directory to your path, you can compile from the command line as I showed in the comment above with `g++ -std=c++14 ...` Try it. – David C. Rankin Apr 09 '22 at 03:32

c++ read file with accents

1 Answers1