-1

I'm looking for a function to test if a given string contains only (thanks @meagar) characters in given language. The sample string is UTF-8; the additional argument can be anything (I imagine it however working with locale strings). It shouldn't return true for any non-alphabetic characters.

As such the output of such function should be:

test("jérôme", "FR_fr") = true
test("jérôme", "PL_pl") = false
test("jrme", "FR_fr") = true
test("jrme", "PL_pl") = true
test("***hi***", "PL_pl") = false

I'm looking for a generic function - as: it should work for any valid locales, be it: FR_fr, PL_pl, GD_ie or ZH_cn.

Any ideas?

edit: valid point by @deceze - let's change this from [language] to [alphabet]

eithed
  • 3,933
  • 6
  • 40
  • 60
  • All of your examples should return `true` http://en.wikipedia.org/wiki/Polish_alphabet – user229044 Jun 04 '13 at 17:12
  • 1
    see this page http://stackoverflow.com/questions/1441562/detect-language-from-string-in-php – mohammad mohsenipur Jun 04 '13 at 17:23
  • "ï" and "é" are not officially part of the "English alphabet", yet "naïve", "resumé" and similar words and spellings are commonly used in English. A-Z are not part of any "Japanese alphabet", yet Japanese text commonly contains words in such "romaji"... – deceze Jun 04 '13 at 17:29
  • @meagar - you're incorrect (search the page you've linked for é or ô). – eithed Jun 04 '13 at 17:47
  • @mohammad mohsenipur - definitely a good resource (and valid resource I'll check it out), yet not doing what I want (I don't want to check what language "test" was written in, I want to test if "test" was written in english, for example). The links there will be helpful though ;) – eithed Jun 04 '13 at 17:48
  • @deceze - that's correct - I'd say that loanword characters are not part of the alphabet, but if EN_en locale would define these characters as valid, they would be valid. – eithed Jun 04 '13 at 17:52
  • @eithed Did you mean, you want to know whether a string contains **only** characters from a given language? Because every single one of your strings contains *some* polish characters. – user229044 Jun 04 '13 at 18:34
  • @meagar - ah, I see now where you're coming from. Yes, that's correct, sorry! – eithed Jun 04 '13 at 18:44

1 Answers1

1

You can use the Unicode "Script" property (assuming your regex engine supports it) to restrict matches to a specific script. You cannot get much more specific than that though.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358