4

Am trying to scrape info from a site.

The site have like this

127 East Zhongshan No 2 Rd; 中山东二路127号

But when i try to scrap it & echo it then it will show

127 East Zhongshan No 2 Rd; 中山ä¸äºè·¯127å· 

I also try UTF-8

There is my php code

now please help me for solve this problem.

function GrabPage($site){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
    curl_setopt($ch, CURLOPT_TIMEOUT, 40);
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($ch, CURLOPT_URL, $site);
    ob_start();
    return curl_exec ($ch);
    ob_end_clean();
    curl_close ($ch);
}
$GrabData   = GrabPage($site);

$dom    = new DOMDocument();
@$dom->loadHTML($GrabData);

$xpath  = new DOMXpath($dom);


$mainElements = array();
$mainElements = $xpath->query("//div[@class='col--one-whole mv--col--one-half wv--col--one-whole'][1]/dl/dt");

foreach ($mainElements as $Names2) {
    $Name2  = $Names2->nodeValue;
    echo "$Name2";
}
Kevin
  • 41,694
  • 12
  • 53
  • 70
Feroz Ahmed
  • 931
  • 10
  • 16
  • This is site URL: http://www.lonelyplanet.com/china/shanghai/transport/transportation-travel-services/jinling-road-ferry – Feroz Ahmed Apr 23 '15 at 05:32
  • loadHTML expectes Latin-1 encoded data, see [PHP DomDocument failing to handle utf-8 characters (☆)](http://stackoverflow.com/q/11309194/367456) for details. – hakre Apr 23 '15 at 06:04
  • What you've got here is a HTML 5+ document with the `` tag. It seems that this tag is not lying about the file encoding so it is *UTF-8*. Default encoding in HTML 0-4.x is *ISO-8895-1*. **DOMDocument** in PHP expects HTML 4.1. – hakre Apr 23 '15 at 06:07

2 Answers2

1

First off, you need to set the charset before anything else on top of PHP file:

header('Content-Type: text/html; charset=utf-8');

You need to convert the html markup you got with mb_convert_encoding:

@$dom->loadHTML(mb_convert_encoding($GrabData, 'HTML-ENTITIES', 'UTF-8'));

Sample Output

Kevin
  • 41,694
  • 12
  • 53
  • 70
0

First thing is to see if the captured HTML source is properly encoded. If yes try

utf8_decode($Name2)

This should get your string ready for saving as well as printing

Clain Dsilva
  • 1,631
  • 3
  • 25
  • 34