How to read Unicode characters (non-English languages) from file so I can reject curse words [C++]?

Question

Goal: Android/iOS app that displays tricky-to-spell words when user puts in a letter.

The code runs correctly, except for non-English characters. The program allows some non-English curse words through when the user puts in their first name.

Expected result: Program will reject all curse words (within nameCurseWordsList.txt)
Actual result: Some non-English curse words are let through.
Error messages: None

Tried:

Changing vector input and the related variable, userFirstName, to wstring.
Checked to see if VS2019 settings were set to "Autodetect UTF-8 encoding without signature" - Already enabled.
Changed encoding of nameCurseWordsList.txt from "UTF-8 with BOM" to "UTF-8".
Set VS2019's cmd output window to use the font Lucida Console.
How to handle unicode character sequences in C/C++?
How can my program switch from ASCII to Unicode?
How to show Unicode characters in Visual Studio?

Code:

// Description: Android/iOS application that takes in a letter and displays tricky-to-spell words.

#include <iostream>
#include <fstream>
#include <locale>
#include <string>
#include <vector>
#include <limits>
#include <sstream>
#include "usingDeclarations.hpp"

vector <string> trickyWordsVector;
vector <wstring> nameCurseWordsVector;
void printWords(char userLetterInput);
void getTrickyWords();
void getNameCurseWords();

int main() {    
    cout << "----------------<>-----------\n";
    cout << "Welcome to Your TRICKY WORDS Helper!\n";
    cout << "----------------<>-----------\n";

    wstring userFirstName;
    size_t firstNameOnlyAlpha{};
    bool isNameCurseWord = false;

    do {
        cout << "\nEnter your first name: ";
        getline(wcin, userFirstName);
        for (unsigned __int8 i = 1; i < userFirstName.length(); i++) {
            if (userFirstName[i - 1] == ' ') {
                userFirstName[i] = toupper(userFirstName[i]);
            } else {
                userFirstName[i] = tolower(userFirstName[i]);
            }
        }
        userFirstName[0] = toupper(userFirstName[0]);
        firstNameOnlyAlpha = userFirstName.find_first_of(L"0123456789`~!@#$%^&*()':'';''/'-_=+{}[]|:<>,.' '?'\t\"");

        getNameCurseWords();

        if (find(nameCurseWordsVector.begin(), nameCurseWordsVector.end(), userFirstName) != nameCurseWordsVector.end()) {
            cout << "Curse word entered. Please freakin' try again.\n";
            isNameCurseWord = true; 
        }   else {
            isNameCurseWord = false;
        }
    } while (isNameCurseWord || firstNameOnlyAlpha != string::npos || userFirstName.empty());

    char userLetterInput; 
    char userChoiceContinue;

    do {
        do {
        cout << "\nEnter a letter [a-z]: ";
        cin >> userLetterInput;
        cin.ignore(numeric_limits<streamsize>::max(), '\n');
        userLetterInput = toupper(userLetterInput);

        if (isalpha(userLetterInput)) {
            wcout << "\nHey " << userFirstName << ",\n\nHere's your list of tricky words for the letter (" << char(toupper(userLetterInput)) << "):\n" << endl;
            }
        } while (!isalpha(userLetterInput));

        getTrickyWords();

        printWords(userLetterInput);

        do {
            cout << "\nWould you like to enter another letter [y,n]?: ";
            cin >> userChoiceContinue;
            cin.ignore(numeric_limits<streamsize>::max(), '\n');
        } while (char(tolower(userChoiceContinue)) != 'y' && char(tolower(userChoiceContinue)) != 'n');
    } while (char(tolower(userChoiceContinue)) == 'y');

    cout << "\n----------------<>-----------\n";
    cout << "Thank you for using Your TRICKY WORDS Helper!\n";
    cout << "\n----------------<>-----------\n";

    return 0;
} // end main()

void printWords(char userLetterInput) {
    for (int i = 0; i < trickyWordsVector.size(); i++) {
        if (trickyWordsVector[i][0] == userLetterInput) {
            cout << trickyWordsVector[i];
            cout << "\n";
        }
    }
} // end printWords()

void getTrickyWords() {
    ifstream trickyWordsFile("trickyWordsList.txt");

    if (trickyWordsFile.is_open()) {
        if (trickyWordsVector.empty()) {
            string line;
            while (getline(trickyWordsFile, line)) {
                if (line.size() > 0) {
                    trickyWordsVector.push_back(line);
                }
            }
        }
    }
    else {
        cerr << "Cannot open the file.";
    }

    trickyWordsFile.close();

} // end getTrickyWords()

void getNameCurseWords() {
    wifstream nameCurseWordsFile("nameCurseWordsList.txt");

    if (nameCurseWordsFile.is_open()) {
        if (nameCurseWordsVector.empty()) {  
            wstring line;
            while (getline(nameCurseWordsFile, line)) {
                if (line.size() > 0) {
                    nameCurseWordsVector.push_back(line);
                }
            }
        }
    }
    else {
        cerr << "Cannot open the file, you sailor mouth. ;)";
    }

    nameCurseWordsFile.close();

} // end getNameCurseWords()

usingDeclarations.hpp

#pragma once

using std::cout;
using std::wcout;
using std::cin;
using std::wcin;
using std::cerr;

using std::getline;
using std::endl;
using std::use_facet;

using std::numeric_limits;
using std::streamsize;
using std::string;
using std::wstring;
using std::ifstream;
using std::wifstream;
using std::vector;
using std::locale;
using std::ctype;

trickyWordsList.txt or trickyWordsFile

Argument 
Atheist 
Axle 
Bellwether 
Broccoli 
Bureau 
Caribbean 
Calendar
Camaraderie 
Desiccate 
Desperate 
Deterrence

nameCurseWordsList.txt or nameCurseWordsFile (partial list)

// Irish
RáIcleach
// German
ScheißKopf
// Russian
Oбосра́ться
Obosrat'sya
// Chinese
王八蛋
Hùn Zhàng
// Japanese
くそ 
// Korean
아, 씨발

Thanks for any advice.

Run code using https://repl.it/~

"Actual result: Some non-English curse words are let through" Define what "some" means, in this case. Specific examples. If this is a case of same visual graphemes from different character sets, your only realistic option is to compile the list of equivalent graphemes, and make the appropriate changes to your logic. C++ will not do this for you, you will need to do this work yourself. Maybe some Unicode library provides this. — Sam Varshavchik, Apr 12 '20 at 14:03
P.S. tolower/toupper are for `char`s, not wide chars. So that's one bug that you need to fix, in any case. — Sam Varshavchik, Apr 12 '20 at 14:06
@SamVarshavchik Thanks for the feedback. I'll fix the bug related to toupper and tolower. For the cursewords, I put some examples in the code. It's struggling to recognize accents on letters and German letters. The only reason the code is rejecting the curse word is because it doesn't recognize the characters, lol. RáIcleach - Irish á | ScheißKopf - German ß | くそ - Japanese, both characters — Electric Egghead, Apr 12 '20 at 20:32
@stark Lol. I don't think so. I need to approach the battle differently. — Electric Egghead, Apr 12 '20 at 20:38
Again: what does "struggling to recognize accents" means? This is like telling your mechanic "my car is struggling to move forward", and expect a useful diagnosis based on that. There's nothing very complicated about comparing text strings in C++. It's one of the simplest tasks that can be done. Either two strings are the same, or they're not. No other possibilities. You have the entire program and all the data available to you, and you are the only one on stackoverflow.com in this position, so only you can debug the code. Did you try using a debugger already, what did you see? — Sam Varshavchik, Apr 12 '20 at 20:40
@SamVarshavchik I used VS Studio 2019's debugger. The German ß shows as an accented a, and the curse word is accepted. I'll work on it more. Thanks for the reply! :) — Electric Egghead, Apr 18 '20 at 12:24
This sounds like an encoding issue. There are several different ways to encode non-English characters. There's multi-byte Unicode encoding, UTF-8, where a single character is encoded as two or more bytes; there's a family of ISO-8859-based encodings, where the high bit bytes are mapped to different characters, with the same bytes being used for different characters between ISO-8859 variations. This is something that you'll need to figure out by looking at the raw bytes. — Sam Varshavchik, Apr 18 '20 at 16:04

How to read Unicode characters (non-English languages) from file so I can reject curse words [C++]?

0 Answers0