2

I am using PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/ to fetch data like Page Title, Meta Description and Meta Tags from other domains and then insert it into database.

But I have some issues with encoding. The problem is that I do not get correct characters from those website which is not in English Language.

Below is the code:

<?php
require 'init.php';

$curl = new curl();
$html = new simple_html_dom();

$page = $_GET['page'];

$curl_output = $curl->getPage($page);

$html->load($curl_output['content']);
$meta_title = $html->find('title', 0)->innertext;

print $meta_title . "<hr />";

// print $html->plaintext . "<hr />";
?>

Output for facebook.compage

Welcome to Facebook — Log in, sign up or learn more

Output for amazon.cnpage

亚马逊-网上购物商城:è¦ç½‘è´­, å°±æ¥Z.cn!

Output for mail.rupage

Mail.Ru: почта, поиÑк в интернете, новоÑти, игры, развлечениÑ

So, the characters is not being encoded properly.

Can anyone help me how to solve this issue so that I can add correct data into my database.

j0k
  • 22,600
  • 28
  • 79
  • 90
Prakash
  • 2,749
  • 4
  • 33
  • 43

3 Answers3

10

@deceze and @Shakti thanks for your help.

+1 for the article link posted by deceze (Handling Unicode Front to Back in a Web App) and it also worth reading Understanding encoding

After reading your comments, answer and of course those two articles, I finally solved my issue.

I have listed the steps I did so far to solve this issue:

  1. Added header('Content-Type: text/html; charset=utf-8'); on the top of my init.php file,
  2. Changed CHARACTER SET of my database table field which is storing those value to UTF-8,
  3. Set MySQL connection charset to UTF-8 mysql_set_charset('utf8', $connection_link_id);
  4. Used htmlentities() function to convert characters $meta_title = htmlentities(trim($meta_title_raw), ENT_QUOTES, 'UTF-8');

Now the issue seems to be solved, BUT I still have to do following thing to solve this issue in FULL.

  1. Get the encoded charset from the source $source_charset.
  2. Change the encoding of the string into UTF-8 if it is already not in the same encoding. For this the only available PHP function is iconv(). Example: iconv($source_charset, "UTF-8", $meta_title_raw);

For getting $source_charset I probably have to use some tricks or multi checking. Like checking headers and meta tag etc. I found a good answer at Detect encoding

Let me know if there are any improvements or any fault on my steps above.

Community
  • 1
  • 1
Prakash
  • 2,749
  • 4
  • 33
  • 43
2

If I switch browser encoding to UTF-8, it works.

So you're simply not setting the correct HTTP header to designate your document to be UTF-8 encoded and the browser is interpreting it in some other encoding. Use:

header('Content-Type: text/html; charset=utf-8');
deceze
  • 510,633
  • 85
  • 743
  • 889
  • The PHP code listed above is just for the purpose of testing which works if I add the content type header. My real code will add the info (value of `$meta_title`) to the database then another page will retrive those value from database, but at that page it is not working even I set to UTF-8. – Prakash Sep 10 '12 at 12:47
  • @Prakash: You must ensure that the database current connection is set to accept `utf-8` data. Run this query `SET NAMES UTF-8` before sending any query to database and also make sure that your database, table, column are set to utf-8 encoding. Then setting UTF-8 header in your another page should work. – Shakti Singh Sep 10 '12 at 12:50
  • 1
    @Prakash Then I recommend you read [Handling Unicode Front To Back In A Web App](http://kunststube.net/frontback/) – deceze Sep 10 '12 at 12:51
0

I had the same problem with Romanian characters. Nothing worked until I used

header('Content-Type: text/html; charset=ISO-8859-2'); 

ISO-8859-2 being the character set for Eastern European letters. So find the right character set for your language and use it in header.

Robert
  • 5,278
  • 43
  • 65
  • 115
Silviu
  • 1