5

I had no luck at finding any package like that, optimally in Python. Is there some library allowing one to graphically compare two strings?

It would, for instance, be helpful to fight against spam, when one uses я instead of R, or worse, things like Α (capital alpha, 0x0391) instead of A, to obfuscate their strings.

The interface to such a package could be something like

distance("Foo", "Bar")  # large distance
distance("Αяe", "Are")  # small distance

Thanks!

tobast
  • 101
  • 4
  • Are you saying that you would like to visualise the results, or you would like the computer to tell the difference between capital alpha and latin 'A' when applying something like an edit distance? – ODP Feb 08 '18 at 08:54
  • @ODP he wants a measure for visual similarity of characters, i.e. a number s(a1,a2) for two strings a1,a2 that tells him how similar they look. I dont think such a package exists though – Banana Feb 08 '18 at 08:58
  • @ODP hopefully clarified by an edit – tobast Feb 08 '18 at 08:58
  • @ODP anyways, have a look at difflib (in particular sequence matcher). But I think it only helps to solve parts of the problem (telling similarities between same letters with a little different ordering). There are also many distance measures for strings, I just don't think they generalize so easily to other unicode characters or even the visual similarity between them. You will probably have to fix similarity numbers beforehand. – Banana Feb 08 '18 at 09:00
  • Also check https://stackoverflow.com/questions/10433657/how-to-determine-character-similarity and https://security.stackexchange.com/questions/128286/list-of-visually-similar-characters-for-detecting-spoofing-and-social-engineeri/128465 for some helpful references regarding that topic. – Banana Feb 08 '18 at 09:07
  • @Tobast I doubt there would be anything like a function that could do something like this. It would likely have to be done in a low-level language (java might be a good one for this) where you directly encode the pixelated structure of the letters. I think what you're asking is a pretty complex thing. – ODP Feb 08 '18 at 09:09

2 Answers2

5

I'm not aware of a package that does this. However, you may be able to use tools like the homoglyph attack generator, the Unicode Consortium's confusables, references from wikipedia's page on the IDN homograph attack, or other such resources to build your own library of look-alikes and build a score based on that.

EDIT: It looks as though the Unicode folks have compiled a great, big database of characters that looks alike. It's available here. If I were you, I'd build a script to read this into a Python dictionary and then parse your string for matches. An excerpt is:

FF4A ;  006A ;  MA  # ( j → j ) FULLWIDTH LATIN SMALL LETTER J → LATIN SMALL LETTER J # →ϳ→
2149 ;  006A ;  MA  # ( ⅉ → j ) DOUBLE-STRUCK ITALIC SMALL J → LATIN SMALL LETTER J # 
1D423 ; 006A ;  MA  # (  → j ) MATHEMATICAL BOLD SMALL J → LATIN SMALL LETTER J  # 
1D457 ; 006A ;  MA  # (  → j ) MATHEMATICAL ITALIC SMALL J → LATIN SMALL LETTER J  # 
Richard
  • 56,349
  • 34
  • 180
  • 251
  • This seems satisfying enough. I'm not marking the answer accepted in case someone shows up with some existing library doing that, but it seems easy enough to do it given those databases. – tobast Feb 08 '18 at 10:06
1

With the information @Richard supplied in his answer, I came up with this short Python 3 script that implements UTS#39:

"""Implement the simple algorithm laid out in UTS#39, paragraph 4
"""

import csv
import re
import unicodedata

comment_pattern = re.compile(r'\s*#.*$')


def skip_comments(lines):
    """
    A filter which skip/strip the comments and yield the
    rest of the lines

    :param lines: any object which we can iterate through such as a file
        object, list, tuple, or generator
    """

    for line in lines:
        line = comment_pattern.sub('', line).strip()
        if line:
            yield line


def normalize(s):
    return unicodedata.normalize("NFD", s)


def to_unicode(code_point):
    return chr(int("0x" + code_point.lower(), 16))


def read_table(file_name):
    d = {}
    with open(file_name) as f:
        reader = csv.reader(skip_comments(f), delimiter=";")
        for row in reader:
            source = to_unicode(row[0])
            prototypes = map(to_unicode, row[1].strip().split())
            d[source] = ''.join(prototypes)
    return d
TABLE = read_table("confusables.txt")


def skeleton(s):
    s = normalize(s)
    s = ''.join(TABLE.get(c, c) for c in s)
    return normalize(s)


def confusable(s1, s2):
    return skeleton(s1) == skeleton(s2)


if __name__ == "__main__":
    for strings in [("Foo", "Bar"), ("Αяe", "Are"), ("j", "j")]:
        print(*strings)
        print("Equal:", strings[0] == strings[1])
        print("Confusable:", confusable(*strings), "\n")

It assumes that the file confusables.txt is in the directory the script is being run from. In addition, I had to delete the first byte of that file, because it was some weird, not-printable, symbol.

It only follows the simple algorithm laid out at the beginning of paragraph 4, not the more complicated cases of whole- and mixed-script confusables laid out in 4.1 and 4.2. That is left as an exercise to the reader.

Note that "я" and "R" are not considered confusable by the unicode group, so this will return False for those two strings.

Graipher
  • 6,891
  • 27
  • 47