Convert UTF-8 to ASCII

Question

The correct answer is that you can't. However I'm looking for an answer that is useful rather than correct.

Spammers convert (even properly spelled) spammy ASCII keywords into different non-ASCII UTF-8 characters that typical (Western) humans easily (and incorrectly) mistake for the original 7-bit ASCII spammy keyword.

What I want is a conversion tool the will perform the inverse to what the spammers are doing, incorrectly convert the UTF-8 string back into a similar looking 7-bit ASCII sequence that looks like the spammy American English word that the spammer wants me to misread (even though, pedantically, the UTF-8 is not from the ASCII subset).

I'm looking for something I can use on the Subject lines of email. Then I can kill the rest of the web page or email before spending 5 minutes downloading it over my high-speed 110 baud acoustic link.

Platform is any language commonly available on a generic Linux system such as a Raspberry Pi running Raspbian or Ubuntu.

This is straightforward in the way you mean, there just isn't a standardized way to do it that I'm familiar with. 2 hours of tedious "look at Unicode character, map to ASCII if it seems appropriate" creating a table work and you should be done. There are a lot of Unicode characters, but they're not infinite. — Rob Napier, Jun 17 '19 at 19:40
Perhaps you could do something different such as replacing non ascii characters by a regex dot (Høst Fæst => /H.st F.est/) and compare against your list of words using regex. It won't be bullet proof but perhaps it is easier to implement than trying to map all similar uft characters. — Juan, Jun 17 '19 at 20:03
@Juan Which introduces another problem; your regex wouldn't match æsthetic to aesthetic, but that is absolutely how it would be read; you'd still need a map to decide how many wildcards you want (two for æ, probably, two for ᇉ, ㆀ, etc.). — Williham Totland, Jun 17 '19 at 20:11
@WillihamTotland You are right. I know it is no a bullet proof solution, and you probably will get false positivies too. The example I stole from you :) wasn't the best one but I think in most cases the replacement would be for one character. — Juan, Jun 17 '19 at 20:24

Williham Totland · Answer 1 · 2019-06-17T20:04:23.337

The answer is still, annoyingly, that you can't.

The fundamental idea is sound, but humans love making life complicated, so some letters have a significant variation in shape between languages.

This means that for a given character sequence, it's not necessarily clear what American English word the sequence is supposed to resemble.

Further to that, even if you could reduce the character seguences reliably, English is cIosely related to a lot of European languages that all use their own idiosyncratic alphabetical variations.

For exannple, reducing "Høst Fæst!" to "Host Fast!" (as well one might) would cause you to incorrectly label the slightly pidgin Norwegian email from your cousin in Minnesota inviting you to Thanksgiving as hosting provider spam.

Of course, invoking either of these things is crossing the river for water:

Simply consider the (all-ASClI) subject line "PilIs! PiIls! PiIIs!".

Convert UTF-8 to ASCII

1 Answers1