Same text in UTF-8 but different in ANSI

Question

I have a Notepad++. The Encoding is UTF-8, in notepad I have two text

Thành
Thành

But when i use Find dialog to search "Thành" the result has only 1 result. I change the Notepad++ encoding to ANSI. It show

ThÃ nh
ThaÌ€nh

Why are they different in ANSI ? What should i do to make they same ?

"ANSI" is not well-defined in this context; Microsoft has in principle agreed to retire it because it's a misnomer and a complication, but they seem to be in no hurry. By the looks of it, what Notepad+? call ANSI is actually code page 1252 in this instance. — tripleee, Apr 10 '23 at 19:23

score 1 · Answer 1 · answered Apr 10 '23 at 18:38

Your strings differ on Unicode Normalization (demonstrated merely for relevant characters):

Form   String Unicode                        Length
----   ------ -------                        ------
(raw)  à à    \u00e0 \u0061\u0300            4
FormC  à à    \u00e0 \u00e0                  3
FormD  à à    \u0061\u0300 \u0061\u0300      5
FormKC à à    \u00e0 \u00e0                  3
FormKD à à    \u0061\u0300 \u0061\u0300      5

The former string is

T (U+0054, Latin Capital Letter T)
h (U+0068, Latin Small Letter H)
à (U+00E0, Latin Small Letter A With Grave)
n (U+006E, Latin Small Letter N)
h (U+0068, Latin Small Letter H)

while the latter one is

T (U+0054, Latin Capital Letter T)
h (U+0068, Latin Small Letter H)
a (U+0061, Latin Small Letter A)
̀ (U+0300, Combining Grave Accent)
n (U+006E, Latin Small Letter N)
h (U+0068, Latin Small Letter H)

You invoke a mojibake case (example in Python for its universal intelligibility):

print('Thành\nThành'.encode('utf-8').decode('cp1252'))

ThÃ nh
ThaÌ€nh

If Notepad++ had a proper Unicode implementation, it would deal with different normalizations and find both matches. It's probably not the only software with this limitation though. — Codo, Apr 10 '23 at 19:29
@Codo I agree, and (for an advanced text editor) I'd expect at least something like `☐ Match Unicode Normalization Forms` check box (similar to and along with `☐ Match case`) in the **Find** dialogue. Strange enough, `python -c "print('Thành' == 'Thành')"` return `False` while (in contrast to) `pwsh -nopro -c "& {'Thành' -eq 'Thành'}"` -> `True`. — JosefZ, Apr 10 '23 at 20:10

Same text in UTF-8 but different in ANSI

1 Answers1