3

I have this set of code which may randomly read a webpage and output the page title to the user, depending on what they input in the url field, it might be an English site, Chinese, Russian or whatever. But the problem is it keeps on displaying garbled text: ¹ù¸»³Ç - Google ËÑË÷

Anyone have idea is greatly appreciated.

<!doctype html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<?php

$DOM = new DOMDocument('1.0', 'UTF-8');

if( !@$DOM->loadHTMLFile( 'http://www.google.com.sg/search?hl=zh-CN&biw=1366&bih=636&q=%E9%83%AD%E5%AF%8C%E5%9F%8E&oq=%E9%83%AD%E5%AF%8C%E5%9F%8Ea&aq=f&aqi=g10&aql=undefined&gs_sm=e&gs_upl=6545l6545l0l1l1l0l0l0l0l295l295l2-1l1aa' ) ) {
    die('cannot load!');
}
else {
    $XPath = new DOMXPath( $DOM );
    $title = strip_tags( $XPath->query('//title')->item(0)->nodeValue );
    echo $title; exit;
}

?>
mauris
  • 42,982
  • 15
  • 99
  • 131
pakito
  • 387
  • 2
  • 3
  • 17

4 Answers4

3

If you add &oe=utf-8 to the query string and use utf8_decode() when outputting the data, that should solve your problem:

$title = utf8_decode(strip_tags($XPath->query('//title')->item(0)->nodeValue));
Francois Deschenes
  • 24,816
  • 4
  • 64
  • 61
  • Hey francois! Thanks works now, I kept using utf8_encode previously. Anyway thanks all for the great help & suggestion! – pakito Jun 27 '11 at 16:23
  • Sorry, but `utf8_decode` converts UTF-8 into ISO 8859-1 and the output returned are absolutely not encodable with ISO 8859-1. – Gumbo Jun 27 '11 at 19:34
1

Google does some user agent sniffing to choose an appropriate output encoding. I’m not sure what user agent PHP’s DOMDocument uses and what the returned character encoding is, but you can force a particular output encoding by using the oe=utf-8 URL parameter.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • I tried putting oe=utf-8 and the output is different but still garbled :( é­å¯å - Google æç´¢ – pakito Jun 27 '11 at 15:36
  • @pakito: Then you’re probably not [specifying your output encoding properly](http://www.w3.org/TR/html401/charset.html#spec-char-encoding) as UTF-8. – Gumbo Jun 27 '11 at 15:38
  • I have basically exhausted my options. I even tried using mb_convert_encoding( $title, 'utf-8', mb_detect_encoding($title, "ascii, cp1252, iso-8859-1, utf-8", true) ); Would appreciate if you could advise me on output properly please? – pakito Jun 27 '11 at 15:54
1

Try setting utf-8 as your content type in PHP...

header ('Content-type: text/html; charset=utf-8');
fire
  • 21,383
  • 17
  • 79
  • 114
0
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

You should be returning the content encoding used by the source page (or explicitly converting the page to utf-8)

symcbean
  • 47,736
  • 6
  • 59
  • 94