I would like the contact form on my website to only accept text submitted in English. I've been dealing with a lot of spam recently that has appeared in multiple languages that is slipping right past the CAPTCHA. There is simply no reason for anyone to submit this form in a language other than English since it's not a business and more of a hobby for personal use.
I've been looking through this documentation and was hopeful that something like preg_match( '/[\p{Latin}]/u', $input)
might work, but I'm not bilingual and don't understand all the nuances of character encoding, so while this will help filter out something like Russian it still allows languages like Vietnamese to slip through.
Ideally I would like it to accept:
- Any Unicode symbol that might be used. I have frequently come across different styles of dashes, apostrophes, or things related to math, for example.
- Common diacritical marks / accented characters found in words like "résumé."
And I would like it to reject:
- Anything that appears to be something other than English, or uncommon. I'm not overly concerned with accents such as "naïve" or in words borrowed from other languages.
I'm thinking of simply stripping all potentially valid characters as follows:
$input = 'testing for English only!';
// reference: https://en.wikipedia.org/wiki/List_of_Unicode_characters
// allowed punctuation
$basic_latin = '`~!@#$%^&*()-_=+[{]}\\|;:\'",<.>/?';
$input = str_replace(str_split($basic_latin), '', $input);
// allowed symbols and accents
$latin1_supplement = '¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿É×é÷';
$input = str_replace(str_split($latin1_supplement), '', $input);
$unicode_symbols = '–—―‗‘’‚‛“”„†‡•…‰′″‹›‼‾⁄⁊';
$input = str_replace(str_split($unicode_symbols), '', $input);
// remove all spaces including tabs and end lines
$input = preg_replace('/\s+/', '', $input);
// check that remaining characters are alpha-numeric
if (strlen($input) > 0 && ctype_alnum($input)) {
echo 'this is English';
} else {
echo 'no bueno señor';
}
However, I'm afraid there might be some perfectly common and valid exceptions that I'm unwittingly leaving out. I'm hoping that someone might be able to offer a more elegant solution or approach?