0

I need to map far-out Unicode Latin substitutes to their respective base letters.

    — should match 4× “Overflow”

These examples from Mathematical Alphanumeric Symbols already comprise considerable variants, but there’s the Enclosed Alphanumeric Supplement and of course Halfwidth and Fullwidth Forms, and a vast lot more, so there’s no point to write up a mapping manually.

It’s not about diacritics (as in questions like this or this, more like this). I also don’t want to merely test for occurence, so even with support, \p{Symbol} wouldn’t help.

If I was looking for a U+0041 LATIN CAPITAL LETTER A, I want to also match U+1D49C MATHEMATICAL SCRIPT CAPITAL A as well as U+FF21 FULLWIDTH LATIN CAPITAL LETTER A and also, say, every other derivative with “[LATIN] LETTER A” in it.

Is there some sort of attributes to Unicode code points, pointing to a possible base character they were derived from, and to to evaluate it programmatically (T-SQL, .NET)?

dakab
  • 5,379
  • 9
  • 43
  • 67
  • What language are you looking for a solution in? (Though as long as it supports unicode normalization, you can easily adapt the python answer already given...) – Shawn Nov 17 '20 at 12:04
  • Compatibility Normalization Form links the character to the "real" one. Note: formatting strings should be done changing the fonts, not changing unicode characters. It may be ok for lettering (and logos, and similar) use, but so I recommend against do it programmatically. – Giacomo Catenazzi Nov 17 '20 at 13:52

0 Answers0