How to map exotic Unicode Latin alphabets to their base letters?

Question

I need to map far-out Unicode Latin substitutes to their respective base letters.

    — should match 4× “Overflow”

These examples from Mathematical Alphanumeric Symbols already comprise considerable variants, but there’s the Enclosed Alphanumeric Supplement and of course Halfwidth and Fullwidth Forms, and a vast lot more, so there’s no point to write up a mapping manually.

It’s not about diacritics (as in questions like this or this, more like this). I also don’t want to merely test for occurence, so even with support, \p{Symbol} wouldn’t help.

If I was looking for a U+0041 LATIN CAPITAL LETTER A, I want to also match U+1D49C MATHEMATICAL SCRIPT CAPITAL A as well as U+FF21 FULLWIDTH LATIN CAPITAL LETTER A and also, say, every other derivative with “[LATIN] LETTER A” in it.

Is there some sort of attributes to Unicode code points, pointing to a possible base character they were derived from, and to to evaluate it programmatically (T-SQL, .NET)?

What language are you looking for a solution in? (Though as long as it supports unicode normalization, you can easily adapt the python answer already given...) — Shawn, Nov 17 '20 at 12:04
Compatibility Normalization Form links the character to the "real" one. Note: formatting strings should be done changing the fonts, not changing unicode characters. It may be ok for lettering (and logos, and similar) use, but so I recommend against do it programmatically. — Giacomo Catenazzi, Nov 17 '20 at 13:52

How to map exotic Unicode Latin alphabets to their base letters?

0 Answers0