0

Possible Duplicate:
regexp with russian lang

I have a regular expression that filters out certain links out of a text and attaches a file icon based on the filetype of the link. Like this:

$text = preg_replace('((<a href="[\w\./:]+getfile.php\?id='.$file.'"([a-zA-Z0-9_\- ,\.:;"=]*)>)([a-zA-Z0-9_,\.:;&\-\(\)\<\>\'/ ]+)</a>)','\\1'.fileicon($name).'</a> \\1\\3</a> ('.($pagecount?$pagecount."&nbsp;".($pagecount>1?$pages:$page1).", ":"").readable_filesize($size,1).')',$text);

this worked great until I tried this with some russian text. The input would be something like:

<a href="/site/getfile.php?id=33">Русский</a>

But it won't show the icon before the link and file information after the link, making me suspect the regex doesn't play well with Russian text. What could be the case here?

Community
  • 1
  • 1
tvgemert
  • 1,436
  • 3
  • 25
  • 50

4 Answers4

2

Your character class only allows [a-zA-Z0-9_,\.:;&\-\(\)\<\>\'/ ]. There are no russion characters in there.

You can fix this by adding the relevant characters to the class. If you only need to support russian, \p{InCyrillic} should do it. If you want all unicode letters, \p{Letter}.

carlpett
  • 12,203
  • 5
  • 48
  • 82
  • 2
    The unicode character classes (and `\pL` would suffice) only work if the `/u` modifier is added however. – mario Sep 07 '11 at 10:04
2

You shall use u modifier when working with Unicode strings:

preg_replace('/>([^<]+)</u', '', $string);
sanmai
  • 29,083
  • 12
  • 64
  • 76
1

You can simplify your regexp down to something like

$re = "~
    (<a\s+href=\".+?getfile\.php\?id=$file\".*?>)
    (.+?)
    </a>
~xui";

this should solve the Cyrillic problem automatically.

user187291
  • 53,363
  • 19
  • 95
  • 127
0

Cyrillic unicode characters are within the range \x0400-\x04FF. Add this range in your character class.

Toto
  • 89,455
  • 62
  • 89
  • 125