3

I have to write a small program to remove accents from a string given as input. I also have to create a function that replaces each single character with accent to the corresponding one without accent, and I have a loop that calls the function for each character in my main:

char func (char c)
{
    string acc = "èé";
    string norm = "ee";
    char ret = c;

    for(int i = 0; i < acc.size(); i++)
    {
        if(c == acc[i])
            ret = acc[i];
    }
    return ret;
}

The problem is that if I provide the string "é" as input inside the main, the string is seen as a string of size 2 (see example below), and the function above is called two times instead that once. Moreover the char given as input to the function is not the correct one. I guess I have the same size problem inside my function. Shouldn't this accent be seen as a single character? (I am using UTF-8)

string s = "e";
cout << "size:" << s.size() << endl;
s = "è";
cout << "size:" << s.size() << endl;

OUTPUT
size:1
size:2

I have solved the problem using the wchar_t ans wstring types, but I need to insert this function in a more complex program and possibly I would like to avoid to change all the code to deal with wstring.

Do I need to change the file encoding? The actual one is:

text/x-c; charset=utf-8

Is it possible to write such a function using normal strings and chars?

Micha Wiedenmann
  • 19,979
  • 21
  • 92
  • 137
Nadir
  • 139
  • 1
  • 3
  • 11
  • 1
    Use `wstring` and `wchar_t`. – iBug Apr 09 '18 at 09:50
  • `è` is not an ascii character and therefore is not represented with a single byte in UTF-8. Also `std::string` is not capable of dealing with UTF-8 text. Subscript and other size related functions won't work properly. If you wan't to store `è` in `std::string` then you should probably use windows-1252 or other single byte encoding. – user7860670 Apr 09 '18 at 09:52
  • 2
    advice to use single byte encoding is 2k18? – RiaD Apr 09 '18 at 09:55
  • This may help you converting between 8 and 16 bit string formats [poss duplicate](https://stackoverflow.com/questions/7232710/convert-between-string-u16string-u32string) – Gem Taylor Apr 09 '18 at 09:58
  • Is ISO-8859-1 single byte encoding? How can I change the encoding? Is it the file encoding or do I need to change the code? Thanks – Nadir Apr 09 '18 at 10:09
  • Possible duplicate of [std::wstring VS std::string](https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring) – YSC Apr 09 '18 at 10:13
  • The linked Q&A is high quality (500+ score) – YSC Apr 09 '18 at 10:14

2 Answers2

4

You should not attempt to do this yourself using simple loops, especially if the code is security-sensitive. There is often more than one way to represent the same character in Unicode, so you might have a single codepoint or you might have two codepoints. For example:

const wchar_t text1[] = { L'e', 0x0301, 0 };
const wchar_t text2[] = { 0x0e9, 0 };

These two strings appear equivalent when printed (they both show ), but they are obviously not the same and a simple == check will fail. You should normalize the strings before searching, or use an existing function that automatically does that for you.

Windows provides NormalizeString and FindStringOrdinal and ICU provides unorm_compare or usearch_first for this purpose.

const wchar_t text1[] = { L'e', 0x0301, 0 };
const wchar_t text2[] = { 0x0e9, 0 };

// Using Windows APIs, try to normalize the string first
int size = NormalizeString(NormalizationKC, text1, -1, nullptr, 0);
if (size == 0)
  throw std::exception("Can't normalize");

auto text3 = std::make_unique<wchar_t[]>(size);
NormalizeString(NormalizationKC, text1, -1, text3.get(), size);

// Print out the three strings - they all look the same
std::wcout << text1 << std::endl;
std::wcout << text2 << std::endl;
std::wcout << text3.get() << std::endl;

// Verify if they are (or are not) equal
if (CompareStringOrdinal(text1, -1, text2, -1, false) == 2)
  std::wcout << L"Original strings are equivalent\r\n";
else
  std::wcout << L"Original strings are not equivalent\r\n";

if (CompareStringOrdinal(text3.get(), -1, text2, -1, false) == 2)
  std::wcout << L"Normalized strings are equivalent\r\n";
else
  std::wcout << L"Normalized strings are not equivalent\r\n";

// Verify if the string text2 can be found
if (FindStringOrdinal(FIND_FROMSTART, text1, -1, text2, -1, TRUE) != -1)
  std::wcout << L"Original string contains the searched-for string\r\n";
else
  std::wcout << L"Original string does not contain the searched-for string\r\n";

if (FindStringOrdinal(FIND_FROMSTART, text3.get(), -1, text2, -1, TRUE) != -1)
  std::wcout << L"Normalized string contains the searched-for string\r\n";
else
  std::wcout << L"Normalized string does not contain the searched-for string\r\n";

// Using ICU APIs, try to compare the normalized strings in one go
// (You can also manually normalize, like Windows, if you want to keep the
// normalized form around)
UErrorCode error{ U_ZERO_ERROR };
auto result = unorm_compare(reinterpret_cast<const UChar*>(text1), -1, 
  reinterpret_cast<const UChar*>(text2), -1, 0, &error);
if (!U_SUCCESS(error))
  throw std::exception("Can't normalize");

if (result == 0)
  std::wcout << L"[ICU] Normalized strings are equivalent\r\n";
else
  std::wcout << L"[ICU] Normalized strings are NOT equivalent\r\n";

// Try searching; ICU handles the equivalency of (non-)normalized
// characters automatically.
auto search = usearch_open(reinterpret_cast<const UChar*>(text2), -1, 
  reinterpret_cast<const UChar*>(text1), -1, "", nullptr, &error);
if (!U_SUCCESS(error))
  throw std::exception("Can't open search");

auto index = usearch_first(search, &error);
if (!U_SUCCESS(error))
  throw std::exception("Can't search");

if (index != USEARCH_DONE)
  std::wcout << L"[ICU] Original string contains the searched-for string\r\n";
else
  std::wcout << L"[ICU] Original string does not contain the searched-for string\r\n";

usearch_close(search);

This produces the following output:

é
é
é
Original strings are not equivalent
Normalized strings are equivalent
Original string does not contain the searched-for string
Normalized string contains the searched-for string
[ICU] Normalized strings are equivalent
[ICU] Original string contains the searched-for string
Peter Torr - MSFT
  • 11,824
  • 3
  • 18
  • 51
1

Store the character in a wchar_t like so

wchar_t text = L'é';

You can also store special characters in wstring:

wstring text = L"étoile";

If you still need to compare a potential special character in a wchar_t (or wstring) with a char or (string), this thread explains how to quite well.

TioneB
  • 468
  • 3
  • 12