How can I write a regular expression that matches all valid Spanish and Arabic words.
In English I know, it is a-zA-z
, in Hebrew it is א-ת
, in Russian А-Яа-яёЁ
.
I use Javascript.

- 16,800
- 14
- 110
- 131

- 124
- 1
- 11
-
possible duplicate of [Javascript + Unicode](http://stackoverflow.com/questions/280712/javascript-unicode) – Qtax Jun 04 '12 at 13:03
-
why would you translate arabic to spanish, you can use the words they are going to understand you ;) half kidding ;) – Sebas Jun 04 '12 at 13:16
-
@Sebas I can't translate from Spanish to Arabic I want to check my validation INPUT field when the text must be either in Spanish or in Arabic – Alex Shvarz Jun 04 '12 at 13:58
-
@Qtax may be it can help me thank you – Alex Shvarz Jun 04 '12 at 13:59
1 Answers
The range a-zA-Z
for English words is unacceptably simple and naïve. It leaves out all manner of letters with accents and other special marks that are used in loan words, etc. For instance, it won't match the word "naïve", from my first sentence. Use the \p{Latin}
script, instead.
The range א-ת
for Hebrew words is also wrong. It leaves out Hebrew presentation forms, cantillation marks, Yiddish digraphs, and more. Use the \p{Hebrew}
script, instead.
The range А-Яа-яёЁ
for Russian is again incomplete and wrong. Use the \p{Cyrillic}
script, instead.
The Spanish alphabet uses the same 26 letters as English, plus ñÑ. But again, don't hardcode these into a range. Many Spanish words use accented vowels. Use the \p{Latin}
script to match Spanish words. Regexes won't help you distinguish Spanish from English.
For Arabic, use the \p{Arabic}
script.
JavaScript, regex, and Unicode
You said you're using JavaScript. Unfortunately, JavaScript has very little support for Unicode built-in. In JavaScript, you need to use the XRegExp library and its Unicode addon. That will allow you to use all of the Unicode scripts I mentioned above in your regular expressions.
Scripts vs blocks
Always favor Unicode scripts over Unicode blocks. Blocks match up poorly with the code points in a particular script. Blocks very often leave out many important code points that fall outside of their incomplete range, and include many code points that have not been assigned any character. Scripts include all relevant code points, and no more.

- 1,394
- 13
- 20