1

I'm trying to get Thai characters from a website. I've tried:

$rawChapter = file_get_contents("URL");
$rawChapter = mb_convert_encoding($rawChapter, 'UTF-8', mb_detect_encoding($rawChapter, 'UTF-8, ISO-8859-1', true));

When I do this then the characters come back like:

¡ÅѺ˹éÒáá¾ÃФÑÁÀÕÃìÀÒÉÒä·Â©ºÑº

But if I take the source of the page I'm trying to load and save that into my own .htm file on my localhost as a utf8 file then it loads the Thai characters correctly. Only when I try to load it from the site directly it breaks.

How can I fix this? What could be the problem?

I've also tried adding this context:

$context = stream_context_create(array(
            'http' => array(
                'method' => 'POST',
                'header' => implode("\r\n", array(
                    'Content-type: application/x-www-form-urlencoded',
                    'Accept-Language: en-us,en;q=0.5',
                    'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7'
                ))
            )
        ));

I've tried adding it alone, I've tried adding it with the mb_convert_encoding()... I feel like I've tried all combinations of this stuff and no success.

Tyler
  • 3,713
  • 6
  • 37
  • 63

1 Answers1

2

Change your Accept-Charset to UTF-8 because ISO-8859-1 does not support Thai characters. If you are running your PHP script on a windows machine, you may also use the windows-874 charset, and you may also try adding this header :

Content-Language: th

But in most cases, UTF-8 will handle pretty much most characters or character sets without any other declaration.

** UPDATE **

Very strange, but this works for me.

$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=> implode("\r\n", array(
                   'Content-type: text/plain; charset=TIS-620'
                   //'Content-type: text/plain; charset=windows-874'  // same thing
                ))
  )
);

$context = stream_context_create($opts);

//$fp = fopen('http://thaipope.org/webbible/01_002.htm', 'rb', false, $context);
//$contents = stream_get_contents($fp);
//fclose($fp);
$contents = file_get_contents("http://thaipope.org/webbible/01_002.htm",false, $context);

header('Content-type: text/html; charset=TIS-620');
//header('Content-type: text/html; charset=windows-874');  // same thing

echo $contents;

Apparently, I was wrong for this one about UTF-8. See here for more details. Though you can still have an UTF-8 output :

$in_charset = 'TIS-620';   // == 'windows-874'
$out_charset = 'utf-8';

$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=> implode("\r\n", array(
                   'Content-type: text/plain; charset=' . $in_charset
                ))
  )
);

$context = stream_context_create($opts);

$contents = file_get_contents("http://thaipope.org/webbible/01_002.htm",false, $context);
if ($in_charset != $out_charset) {
    $contents = iconv($in_charset, $out_charset, $contents);
}

header('Content-type: text/html; charset=' . $out_charset);

echo $contents;   // output in UTF-8
Yanick Rochon
  • 51,409
  • 25
  • 133
  • 214
  • Ok, I took out ISO and kept the utf-8 part and added that header and put it in: – Tyler Feb 07 '11 at 02:55
  • whoa, I don't know how to use comments... I got this: $rawChapter = file_get_contents("http://thaipope.org/webbible/01_002.htm",false, $context); and it returned: ��Ѻ˹���á��Ф�����������©�Ѻ – Tyler Feb 07 '11 at 02:56
  • 1
    yes, you got your string alright. The problem you see is that your string is fine (contains thai chars), but you echo it using ISO-8859-1. If your output is HTML, use `header('Content-type: text/html; charset=utf-8');` If your output is plain text, use `header('Content-type: text/plain; charset=utf-8');` – Yanick Rochon Feb 07 '11 at 02:58
  • I tried loading another thai site and it echos properly, as well if I copy the source of that thai site and put it on my localhost it echo's properly as well. I'll try adding the things you said. – Tyler Feb 07 '11 at 03:05
  • 1
    I would suggest you always work with UTF-8; that is save your PHP scource file as UTF-8 and using `charset=utf-8` always. This way, you never have encoding or character jibbrish problem. – Yanick Rochon Feb 07 '11 at 03:08
  • Wow, thanks, yeah that works. What exactly is that header doing at the end? Is that changing the php pages meta tags? – Tyler Feb 07 '11 at 04:33
  • it's specifying which encoding the server is sending the page or data into. You should set the charset to the encoding you save your file into for best results. Consequently, you should always save your files as UTF-8 :) – Yanick Rochon Feb 07 '11 at 04:47
  • Thanks, I'm taking out the thai text and saving it in my db in utf8_unicode_ci and it seems to be working great, really appreciate the help, that was hours of no-fun-ness :( – Tyler Feb 07 '11 at 04:59
  • In that case, the last snippet is what you really needed. glad I could help : – Yanick Rochon Feb 07 '11 at 05:02