1

I'm basically making a script which analyzes domains and part of this is getting their "anchor texts" and seeing whether these strings contain any Chinese symbols.

I'm using this code but it doesn't seem to work:

foreach ($anchors as $anchor) {

        // echo $anchor;

                if (preg_match("/\p{Han}+/u", $anchor))
                    $chinese_flag = 1;


                if($chinese_flag == 1):
                    echo "Found Chinese anchor in: " . $anchor;
                    break;
                endif;
}

When trying to echo out each anchor, I can clearly see that some of the anchors use Chinese symbols such as 中文网站100强 (just giving an example). What am I doing wrong here?

P.S. I've also tried some other RE that I found on stack overflow but none seem to work in my case.

Infinite Recursion
  • 6,511
  • 28
  • 39
  • 51
Zannix
  • 1,473
  • 3
  • 16
  • 26
  • `echo $anchor;` prints something? – chanchal118 Dec 27 '13 at 10:21
  • yes, it prints the $anchor normally, both chinese and non chinese anchors – Zannix Dec 27 '13 at 10:22
  • http://php.net/manual/ja/function.preg-match.php#94424 – chanchal118 Dec 27 '13 at 10:31
  • also worth to read: http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode – Raptor Dec 27 '13 at 10:32
  • Cannot reproduce the problem: http://3v4l.org/BTXKY You need to provide more details. – deceze Dec 27 '13 at 10:34
  • Also, prefix preg_match syntax with UTF-8 may also be useful: http://stackoverflow.com/a/9473867/188331 – Raptor Dec 27 '13 at 10:35
  • Deceze, that's very interesting... Note: I'm getting anchors from an external website by using cURL, using a regular expression to find anchors and storing them in an array. As I said, I've tried echoing them out and they seem to be saved fine... – Zannix Dec 27 '13 at 10:44

1 Answers1

1

This seems to work:

foreach ($anchors as $anchor) {

                $chinese_flag = FALSE;

                if (preg_match("/[\p{Han}]/simu", $anchor))
                    $chinese_flag = TRUE;


                if($chinese_flag):
                    echo "Found Chinese anchor in: " . $anchor;
                    break;
                endif;
}

Based on your comments, I've updated the answer:

<?php
$test = '&#x4E2D;';

$anchor = html_entity_decode($test, ENT_COMPAT, 'UTF-8');


if (preg_match("/[\p{Han}]/simu", $anchor)) {
    echo 'Yay';
}
?>
Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
  • Doesn't seem to work for me... how have you tested this? Note: I'm getting anchors from an external website by using cURL, using a regular expression to find anchors and storing them in an array. As I said, I've tried echoing them out and they seem to be saved fine... – Zannix Dec 27 '13 at 10:43
  • You can also force the script to use utf-8 by adding this to the top of your script: `ini_set('default_charset', 'UTF-8');` – Pedro Lobito Dec 27 '13 at 11:54
  • I added the ini_set to the top and tried outputting the cURL result to a file and there weren't chinese symbols but this instead: 中文 – Zannix Dec 27 '13 at 13:20
  • are the original source characters like that, or only after the curl? – Pedro Lobito Dec 27 '13 at 13:59
  • Convert the curl output with : `$string = iconv("gb2312", "utf8", $curloutput);` – Pedro Lobito Dec 27 '13 at 19:09
  • hey you're right, the original source code characters are encoded like that, they seem to start with '' and end with ';', each of the characters. do you know any regular expression that could check if a string contains something that starts and begins with those patterns? that's how i would know they're chinese i guess – Zannix Dec 28 '13 at 00:22
  • hmm, I've tried it but still didn't come to a solution, maybe I'm doing something wrong? here's a snippet: http://3v4l.org/jlVS4 – Zannix Dec 28 '13 at 10:20
  • I've updated the answer, it works perfectly now on php 5.1.3 - 5.5.7. http://3v4l.org/1J6H5 – Pedro Lobito Dec 28 '13 at 10:55