2

I'm attempting to preg match a link of which is half in English, half in Arabic.

The link as an example looks like:

"/<arabic>/123/<arabic>-<english>.html" 

The basic preg_match('@<a href="/(.*?).html" >); returns everything back however the Arabic within the URL means that it is no longer identifiable to a page, returning "دانلود-رایÚ" for example.

I've attempted some things I've seen such as \p{Arabic} however this returns nothing. Is there a way to be able to capture these links?

It's something I'm pretty stumped with and can't figure out a way around this issue.

Edit to add preg match & what I'm attempting to match.

preg_match_all('@<a href="/\p{Arabic}/(.*?)/\p{Arabic}-(.*?)" >@iu',$page,$link);

example text -

"a href="/دانلود-رایگان-کتاب/کتاب-های-خارجی/مطلب/2120-the-essential-financial.html"
Francis Laclé
  • 384
  • 2
  • 6
  • 22
  • could you include a code snippet including the regular expression and sample text you're trying to match against? – Jeff Lambert Nov 11 '14 at 17:02
  • 2
    this post may help : http://stackoverflow.com/questions/12046526/preg-replace-and-preg-match-arabic-characters – teeyo Nov 11 '14 at 17:02
  • I have just edited in the code & example text. Thanks for the link teeyo I did see that but wasn't sure if you had to know what characters were required etc. I will look into that now –  Nov 11 '14 at 17:13

1 Answers1

0

Think twice before using regex to parse HTML.

$doc = new DOMDocument();
$doc->loadHTML($yourHTML);

$links = $doc->getElementsByTagName('a');

foreach($links as $link){
  echo $link->getAttribute('href');
}
Community
  • 1
  • 1
dynamic
  • 46,985
  • 55
  • 154
  • 231