18

I want to disallow certain UTF-8 input (server-side), e.g. eastern languages, where example input might be " 伊 ".

However, I do want to continue supporting other latin or "latin-like" characters, such as the welsh ŵ and ŷ, so checking against latin-1 is not possible.

What are my options? (if language specific, PHP preferred)

Thanks very much.


Reasoning: browser support for a lot of non-western characters is often missing (e.g. on a different browser I just see a box in the question above), so for things like display names sometimes it's appropriate to restrict it even if it's not appropriate for message bodies

HoboBen
  • 2,900
  • 4
  • 21
  • 26
  • 2
    Do you mind if I ask why you don't want to allow some languages on an internationalized site? – Borealid Aug 05 '10 at 03:45
  • Fair question. It's just necessary for one field of a table; the rest of the website will support it. – HoboBen Aug 05 '10 at 03:56
  • 2
    So what is the subset of characters you're allowing? Does it fit an existing character set? If so, you can just `iconv` the string to the target encoding, discarding all invalid characters. – deceze Aug 05 '10 at 04:00
  • 1
    browser support for a lot of non-western characters is often missing (e.g. on a different browser I just see a box in the question above), so for things like display names sometimes it's appropriate to restrict it even if it's not appropriate for message bodies – HoboBen Jun 09 '14 at 02:50

1 Answers1

40

Just do

preg_match('/[^\\p{Common}\\p{Latin}]/u', $string)

where $string is an UTF-8 string. This will return "1" if there are non-latin characters and will return "0" otherwise.

Example:

var_dump(preg_match('/[^\\p{Common}\\p{Latin}]/u', 'sf..ŷaás??'));  //int(0)
var_dump(preg_match('/[^\\p{Common}\\p{Latin}]/u', 'sf..ŷݤaás??')); //int(1)
Artefacto
  • 96,375
  • 17
  • 202
  • 225