11

I can't solve my problem with regexp.

Ok, when i type:

$string = preg_replace("#\[name=([a-zA-Z0-9 .-]+)*]#","$name_start $1 $name_end",$string);

everything is ok, except situation with Russian language.

so, i try to re-type this reg-exp:

$string = preg_replace("#\[name=([a-zA-Z0-9**а-яА-Я** .-]+)*]#","$name_start $1 $name_end",$string);

but this not working,

i know some idea, just write:

$string = preg_replace("#\[name=([a-zA-Z0-9йцукенгшщзхъфывапролджэячсмитьбю .-]+)*]#","$name_start $1 $name_end",$string);

but this is crazy :D

please, give me simple variant

Slava Vedenin
  • 58,326
  • 13
  • 40
  • 59
vorobey
  • 4,401
  • 3
  • 18
  • 20
  • Please share more details, like sample input strings and the expected output corresponding to these strings – Nico Haase Apr 24 '21 at 18:26

4 Answers4

19

Try a Unicode range:

'/[\x{0410}-\x{042F}]/u'  // matches a capital cyrillic letter in the range A to Ya

Don't forget the /u flag for Unicode.

In your case:

"#\[name=([a-zA-Z0-9\x{0430}-\x{044F}\x{0410}-\x{042F} .-]+)*]#u"

Note that the STAR in your regex is redundant. Everything already gets "eaten" by the PLUS. This would do the same:

"#\[name=([a-zA-Z0-9\x{0430}-\x{044F}\x{0410}-\x{042F} .-]+)]#u"
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • it's bad, cause it's unreadable! Code must be simple and readable :) – bmalets Jul 03 '13 at 10:31
  • I think you meant `\x{0401}-\x{042f}` for russian words. `A` indeed is the first character in the russian alphabet, but not in the unicode range. Check the unicode table [here](http://www.wikiwand.com/en/Russian_alphabet) – Iulian Onofrei Jan 12 '15 at 13:10
  • @Iulian Onofrei, yeah, I indeed see `\x{0401}` in there, but don't see the range `\x{0402}-\x{0409}`. Should it then be `[\x{0401}\x{0410}-\x{042F}]`, or really `[\x{0401}-\x{042F}]` ? Feel free to edit, of course! – Bart Kiers Jan 12 '15 at 15:47
  • @BartKiers, yes, you're wright. I missed that. I think the correct solution is: `[\x{0430}-\x{044F}\x{0451}]` for lowercase letters and `[\x{0401}\x{0410}-\x{042F}]` for uppercase letters. – Iulian Onofrei Jan 20 '15 at 14:45
7

The common unicode script (supported since pcre 3.3) provides a test for the property Cyrillic.

e.g. replace all characters that are neither cyrillic nor (latin) digits:

$string = '1a2b3cйdцeуfкбxюy';
echo preg_replace('/[^0-9\p{Cyrillic}]/u', '*', $string);

You can find the documentation for that feature under http://www.pcre.org/pcre.txt "Unicode character properties".
And you have to specify the PCRE8 flag (u) as described at http://docs.php.net/reference.pcre.pattern.modifiers

VolkerK
  • 95,432
  • 20
  • 163
  • 226
  • Afaik there are no (technical) differences between the "common" scripts and specifying the ranges "manually". So it's only a matter of choice. Except maybe that the property names are a bit more self-documenting. – VolkerK Oct 15 '09 at 09:38
0

This one worked for me:

/^[а-яА-Я\p{Cyrillic}0-9\s\-]+$/ 

I have tested in all the browsers including Safari

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Kissa Mia
  • 297
  • 8
  • 23
0

Among the most used alphabet in the internet.

This works since a good while now, I believe since php 5.6.

// Filter Chinese and Japanese HAN
if (preg_match("/\p{Han}+/u", " 余TEST杭丽人广播", $match)){echo "CHINESE, JAPANESE ";}
// Filter Cyrilic
if (preg_match("/\p{Cyrillic}/u", "Күңел радиосы ", $match)){echo "RUSSIAN ";}
// Filter Greek
if (preg_match("/\p{Greek}/u", "Πρακτορείο ", $match)){echo "GREEK ";}
// Filter Arabic
if (preg_match("/\p{Arabic}/u", "مشال راډیو", $match)){echo "ARABIC ";}
// Filter Armenian
if (preg_match("/\p{Armenian}/u", "Ազատություն ", $match)){echo "ARMENIAN ";}
// Filter Thai
if (preg_match("/\p{Thai}/u", "สวท.พะเยา", $match)){echo "THAI ";}
// Filter Georgian
if (preg_match("/\p{Georgian}/u", "რადიო თავისუფალი", $match)){echo "GEORGIAN";}

/* Output: */
/* CHINESE, JAPANESE RUSSIAN GREEK ARABIC ARMENIAN THAI GEORGIAN */
NVRM
  • 11,480
  • 1
  • 88
  • 87
  • Please add some explanation to your answer such that others can learn from it. As far as I see, the OP did not ask for a regular expression to detect a language – Nico Haase Apr 24 '21 at 18:26
  • This is the point: no explanation needed. – NVRM Apr 24 '21 at 18:28