0

There are multiple Unicode characters that are visually similar, such as:

":" and "꞉"     U+A789
"?" and "?"    U+FF1F
"*" and "⁎"     U+204E
"'", "`", "‘", "’", and "ʻ"

There are also characters with and without diacritical marks, such as:

"c" and "ç"     U+00E7
"E" and "É"     U+00C9
"I" and "İ"     U+00ED
"i" and "ı"     U+0131

I'd like to compare text from various sources, with effectively the same words comparing as equal, such as:

"naive" and "naïve"
"facade" and "façade"
"Hawai'i" and "Hawaiʻi"
"don't" and "don’t"
"letter 'A'" and "letter ‘A’" and "letter `A'"
"letter "B"" and "letter "“B”"

Is there a module that provides tests of equivalence between such characters?

import unicode
if x != y and not unicode.same_character(x, y):
Ray Butterworth
  • 588
  • 1
  • 3
  • 18
  • http://www.unicode.org/Public/security/latest/confusables.txt – ForceBru Feb 12 '22 at 15:26
  • I disagree that the first snippet shows characters that are "equivalent" or even "visually the same". Some typefaces may use the same glyph for these, but that doesn't mean anything. On my machine (Arch Linux, Firefox, Iosevka), all three pairs even look very different. – ChrisGPT was on strike Feb 12 '22 at 15:39
  • @ForceBru, thanks. If nothing else comes along, I can use that file to write my own function. (It's massive overkill for what I need, but it's probably easier to do all of them than spend time selecting the ones that I *do* need). – Ray Butterworth Feb 12 '22 at 15:39
  • @Chris, for *my* purposes they *are* equivalent. I'm comparing input from multiple people that might enter things slightly differently. Perhaps apostrophe, open-single-quote, and right-single-quote might have been a better example of three equivalent characters. – Ray Butterworth Feb 12 '22 at 15:42
  • @RayButterworth, my point is that in truth they are _not_ equivalent and you'll have to be a lot more specific about what you consider "equivalent" to mean. – ChrisGPT was on strike Feb 12 '22 at 15:44
  • @Chris, as a practical example, I'd like "Hawai'i" and "Hawaiʻi" to be considered the same (but not "Hawaii"). The second one is correctly spelled with an okina, but the first is the much more common usage, spelled with an apostrophe. Similarly "Istanbul" and "İstanbul" should compare as the same. – Ray Butterworth Feb 12 '22 at 15:48
  • You could use the [`unidecode` module](https://pypi.org/project/Unidecode/) to convert each string to its closest ASCII equivalent. – Mark Ransom Feb 12 '22 at 16:48

2 Answers2

1

If you transliterate characters, you'll get close to that.

In some cases you mention (e.g. "i" and "ı"), they are not the same character but it might do what you want anyway.

Transliteration is language specific.

Try this module: https://pypi.org/project/transliterate/

Javier
  • 2,752
  • 15
  • 30
1

I recommend reading about normalization forms. https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c
Also, this will help you with diacritical marks. What is the best way to remove accents (normalize) in a Python unicode string?

Probably this code does what you want. It is written in a form of an app. You choose between 2 comparison functions. You can decide which is yours.

import unicodedata
import unidecode

if input("Ignore diacritical?\t") in ("y","yes"):
    norm_func= lambda s: unidecode.unidecode(unicodedata.normalize("NFKD",s))
else:
    norm_func= lambda s: unicodedata.normalize("NFKD", s)

same_characted= lambda c1,c2: norm_func(c1)==norm_func(c2)


#Test for Greek question mark.
print(same_characted("\x3b", "\u037E"))
print(same_characted(";",";"))
#Test for diacritical
print(same_characted("a", "ą"))
#Test for somethink else
print(same_characted("a", "b"))
Sir
  • 337
  • 1
  • 7