Searching for foreign variations of characters not present in the ASCII table

Question

I've run in to a bit of a problem. I'm making a very super basic script just to see how easy the concept is and I'm really not sure where I should start with it.

My script does the following:

I have an array of words, which will be taken from a DB, but for the sake of this demonstration I've just made it an array with 2 words, "hello" and "goodbye". Normally these words will be words that are considered offensive. My script will replace all occurrences of the words in the array with *s, as to censor them out.

One thing I know quite well, as I use a few games etc. that have a similar system, is that this is easily bypass-able by using characters such as é instead of e. Hello = ***** but Héllo = Hello.

What I'd like to know is this. As I've not done anything regarding UTF-8 encoding, nor do I really know how it works with PHP, is there a way to get all variations of a character? So an e/E with all of the possible accents that exist within UTF-8. If it were ASCII I'd simply make an array with all of the ASCII numbers and work that in to the code, however I've not been able to find a way to do something similar with UTF-8 characters.

My code works fine, so there's no need for me to post it unless somebody asks me to, but what I'd like to achieve is something similar to this, but with UTF-8.

$a = array(65,97);
foreach($a as $x){
    echo chr($x) . '<br />';
}

This will, obviously, just show A and a. This, I could work in to my code and replace the words even if they contained these characters as well. Something similar would be awesome if possible.

Cheers guys/gals.

An addition: I would like to achieve this without actually typing the foreign characters in my code. I don't want é etc. in my PHP, I'd like to convert from something, in the same way as my code does above, but obviously not with ASCII; something else.

possible duplicate of [Replacing accented characters php](http://stackoverflow.com/questions/3371697/replacing-accented-characters-php) — Will B., May 10 '15 at 07:26
I'm asking if there's a way to do it without actually typing the characters in my code. Converting from letters/symbols/numbers that I can actually type with my own keyboard. The question you linked does not have an answer that will help me with this. — Chris Evans, May 10 '15 at 07:30
your question is tl;tr + plus it is still unclear what you are asking. Drop all that words and show us a short code example — hek2mgl, May 10 '15 at 07:36
Not to worry anyway, I've managed to work it out. I felt that I couldn't explain it without "all that words", and if you'd actually read it then you'd know that I didn't have any code to show, because the question wasn't regarding fixing code, it was regarding a concept that I didn't know about. I very clearly wrote that in the question, like so "My code works fine, so there's no need for me to post it unless somebody asks me to" - But thanks. — Chris Evans, May 10 '15 at 07:52

score 0 · Answer 1 · answered May 10 '15 at 07:50

I've actually stumbled upon something that helped me to decipher my own answer. I was looking at a Unicode conversion tool and found a "Decimal Numeric Character Reference" version of each of the characters that I wanted to use. Then I wrote this code to try it out, and voila, it worked.

//My hexadecimal numbers
$hexes = array(
    100, 101, 102, 103 //These are 4 of the a's, and I will add the rest
);
//For each of the numbers
foreach($hexes as $x){
    //Display the NCR for this number
    echo "&#x$x;";
}

I think my question was easy enough to understand but apparently I was wrong. Hopefully this'll help somebody else in the future. Thanks.

score 0 · Answer 2 · edited May 23 '17 at 12:21

From the question I marked as a duplicate (Replacing accented characters php) based on your example of:

Hello = ***** but Héllo = Hello,

Objective find any variance of the word Hello as Héllo Hëllo, etc and convert to version of Hello. Preserve special characters such as €, ⓚ, and ⓞ, etc.

http://php.net/manual/en/book.intl.php

Code: (note $test are the strings to normalize)

$test = ['abcd', 'èe', '€', 'àòùìéëü', 'àòùìéëü', 'tiësto', 'Héllo', 'ĀāĂă'];
$transliterator = Transliterator::createFromRules(':: NFD; :: [:Nonspacing Mark:] Remove; :: NFC;', Transliterator::FORWARD);
foreach ($test as $e) {
    $normalized = $transliterator->transliterate($e);
    echo $e . ' --> ' . $normalized . "<br/>";
}

Output:

abcd --> abcd
èe --> ee
€ --> €
àòùìéëü --> aouieeu
àòùìéëü --> aouieeu
tiësto --> tiesto
Héllo --> Hello
ĀāĂă --> AaAa

In case you are unable to use Transliterator you could opt to use the iconv example from the same question.

iconv: http://ideone.com/jOw5Cu

preg_replace('/[^A-Z|^a-z|^0-9]/', '', iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $e));

Output:

abcd --> abcd
èe --> ee
€ --> EUR
àòùìéëü --> aouieeu
àòùìéëü --> aouieeu
tiësto --> tiesto
Héllo --> Hello
ⓚ --> k,
ĀāĂă --> AaAa

Otherwise you will need to build your own dictionary list of characters to convert from and their resulting values. EG: matching H3II0

MySQL:

To take a more programmatic approach, you could use your database to lookup the filters. Assuming your censor database is MySQL.

SELECT 'Hello' = 'Héllo'; //1
SELECT 'AaAa' = 'ĀāĂă'; //1
SELECT SOUNDEX('Hello') = SOUNDEX('Héllo'); //1

http://sqlfiddle.com/#!9/74d39/1

Or even replace the extended characters with % to perform a LIKE query on

$word = preg_replace('/[^A-Z|^a-z|^0-9]/', '%', 'Héllo');
$stmt = $mysqli->prepare("SELECT word FROM censor WHERE word LIKE ?");
$stmt->bind_param("s", $word);
$stmt->execute();

The issue with your title is that you ask for characters not present in the ASCII table, while the extended ASCII table (128-255) include the characters you reference for conversion. http://www.asciitable.com/

The majority of your question doesn't describe your goal clearly. Which is to normalize the special (UTF-8) characters used as a way to by-pass censorship of normal (ASCII) character words. Where variances of the word Hello should match variants such as Héllo or Hëllò.

Searching for foreign variations of characters not present in the ASCII table

2 Answers2