0

I have a Notepad++. The Encoding is UTF-8, in notepad I have two text

Thành
Thành

But when i use Find dialog to search "Thành" the result has only 1 result. I change the Notepad++ encoding to ANSI. It show

Thành
Thành

Why are they different in ANSI ? What should i do to make they same ?

user2877989
  • 587
  • 1
  • 6
  • 19
  • "ANSI" is not well-defined in this context; Microsoft has in principle agreed to retire it because it's a misnomer and a complication, but they seem to be in no hurry. By the looks of it, what Notepad+? call ANSI is actually code page 1252 in this instance. – tripleee Apr 10 '23 at 19:23

1 Answers1

1

Your strings differ on Unicode Normalization (demonstrated merely for relevant characters):

Form   String Unicode                        Length
----   ------ -------                        ------
(raw)  à à    \u00e0 \u0061\u0300            4
FormC  à à    \u00e0 \u00e0                  3
FormD  à à    \u0061\u0300 \u0061\u0300      5
FormKC à à    \u00e0 \u00e0                  3
FormKD à à    \u0061\u0300 \u0061\u0300      5

The former string is

  • T (U+0054, Latin Capital Letter T)
  • h (U+0068, Latin Small Letter H)
  • à (U+00E0, Latin Small Letter A With Grave)
  • n (U+006E, Latin Small Letter N)
  • h (U+0068, Latin Small Letter H)

while the latter one is

  • T (U+0054, Latin Capital Letter T)
  • h (U+0068, Latin Small Letter H)
  • a (U+0061, Latin Small Letter A)
  • ̀ (U+0300, Combining Grave Accent)
  • n (U+006E, Latin Small Letter N)
  • h (U+0068, Latin Small Letter H)

You invoke a mojibake case (example in Python for its universal intelligibility):

print('Thành\nThành'.encode('utf-8').decode('cp1252'))
Thành
Thành
JosefZ
  • 28,460
  • 5
  • 44
  • 83
  • If Notepad++ had a proper Unicode implementation, it would deal with different normalizations and find both matches. It's probably not the only software with this limitation though. – Codo Apr 10 '23 at 19:29
  • @Codo I agree, and (for an advanced text editor) I'd expect at least something like `☐ Match Unicode Normalization Forms` check box (similar to and along with `☐ Match case`) in the **Find** dialogue. Strange enough, `python -c "print('Thành' == 'Thành')"` return `False` while (in contrast to) `pwsh -nopro -c "& {'Thành' -eq 'Thành'}"` -> `True`. – JosefZ Apr 10 '23 at 20:10