I was looking into Matt Parker's problem of finding 5 5-letters words that have in total 25 distinct letters (link if you are interested, not really relevant to the issue), but I wanted to do it using french words (and so I had to use the french alphabet).
So I downloaded the following file from github, containing 1 french word per line, encoded in utf-8, with the accents. But something I never seen before occured : There seems to be two ways (maybe more) to encode accents in utf-8.
When I opened the file in VSCode (also works with Windows' Notepad), every accent displayed as they should, but :
- when I select a letter with an accent, it shows that 2 letters are selected
- when I write a letter with an accent using a french keyboard, and then select it, it shows that only 1 letter is selected (so the two appear to be distinct, though they display exactly identically)
I then tried the following code in python :
# first is typed from keyboard, second is typed from keyboard, returns true
print("é" == "é")
# first is copied from downloaded text file, second is typed from keyboard, returns false
print("é" == "é")
It displays identically, but is not identical !?
I then tried to change the encoding of the file, but it either changed nothing, or removed the accents : the word "abaissé" might for example show as "abaisse�".
After testing, it seems like the file is encoding accented letters as a special accent character adjacent to the letter character it wants to put the accent on. But when I then read the file character by character (it does it in rust for example, but I'm pretty sure it does the same in other languages), they come as two characters : I first get the accent and then the letter. And rust chars are by default valid unicode units, so if they where 1 single unicode char, rust should read them as 1 single unicode char. It's not the case.
I'm honestly stuck, so if you know, I would be really grateful for your help.
Why are these combos of characters showing as 1 character when they should display as 2 distinct ? I know this is not a bug from utf-8, but it makes handling the data very hard. So how do I recombine them together, as the accent shouldn't be decoupled from the letter in the first place ?