You should not attempt to do this yourself using simple loops, especially if the code is security-sensitive. There is often more than one way to represent the same character in Unicode, so you might have a single codepoint or you might have two codepoints. For example:
const wchar_t text1[] = { L'e', 0x0301, 0 };
const wchar_t text2[] = { 0x0e9, 0 };
These two strings appear equivalent when printed (they both show é
), but they are obviously not the same and a simple ==
check will fail. You should normalize the strings before searching, or use an existing function that automatically does that for you.
Windows provides NormalizeString
and FindStringOrdinal
and ICU provides unorm_compare
or usearch_first
for this purpose.
const wchar_t text1[] = { L'e', 0x0301, 0 };
const wchar_t text2[] = { 0x0e9, 0 };
// Using Windows APIs, try to normalize the string first
int size = NormalizeString(NormalizationKC, text1, -1, nullptr, 0);
if (size == 0)
throw std::exception("Can't normalize");
auto text3 = std::make_unique<wchar_t[]>(size);
NormalizeString(NormalizationKC, text1, -1, text3.get(), size);
// Print out the three strings - they all look the same
std::wcout << text1 << std::endl;
std::wcout << text2 << std::endl;
std::wcout << text3.get() << std::endl;
// Verify if they are (or are not) equal
if (CompareStringOrdinal(text1, -1, text2, -1, false) == 2)
std::wcout << L"Original strings are equivalent\r\n";
else
std::wcout << L"Original strings are not equivalent\r\n";
if (CompareStringOrdinal(text3.get(), -1, text2, -1, false) == 2)
std::wcout << L"Normalized strings are equivalent\r\n";
else
std::wcout << L"Normalized strings are not equivalent\r\n";
// Verify if the string text2 can be found
if (FindStringOrdinal(FIND_FROMSTART, text1, -1, text2, -1, TRUE) != -1)
std::wcout << L"Original string contains the searched-for string\r\n";
else
std::wcout << L"Original string does not contain the searched-for string\r\n";
if (FindStringOrdinal(FIND_FROMSTART, text3.get(), -1, text2, -1, TRUE) != -1)
std::wcout << L"Normalized string contains the searched-for string\r\n";
else
std::wcout << L"Normalized string does not contain the searched-for string\r\n";
// Using ICU APIs, try to compare the normalized strings in one go
// (You can also manually normalize, like Windows, if you want to keep the
// normalized form around)
UErrorCode error{ U_ZERO_ERROR };
auto result = unorm_compare(reinterpret_cast<const UChar*>(text1), -1,
reinterpret_cast<const UChar*>(text2), -1, 0, &error);
if (!U_SUCCESS(error))
throw std::exception("Can't normalize");
if (result == 0)
std::wcout << L"[ICU] Normalized strings are equivalent\r\n";
else
std::wcout << L"[ICU] Normalized strings are NOT equivalent\r\n";
// Try searching; ICU handles the equivalency of (non-)normalized
// characters automatically.
auto search = usearch_open(reinterpret_cast<const UChar*>(text2), -1,
reinterpret_cast<const UChar*>(text1), -1, "", nullptr, &error);
if (!U_SUCCESS(error))
throw std::exception("Can't open search");
auto index = usearch_first(search, &error);
if (!U_SUCCESS(error))
throw std::exception("Can't search");
if (index != USEARCH_DONE)
std::wcout << L"[ICU] Original string contains the searched-for string\r\n";
else
std::wcout << L"[ICU] Original string does not contain the searched-for string\r\n";
usearch_close(search);
This produces the following output:
é
é
é
Original strings are not equivalent
Normalized strings are equivalent
Original string does not contain the searched-for string
Normalized string contains the searched-for string
[ICU] Normalized strings are equivalent
[ICU] Original string contains the searched-for string