4

I'm working on a web crawler that grabs data from sites all over the world, and is dealing with distinct languages and encodings.

Currently I'm using the following function, and it works in 99% of the cases. But there is this 1% that is giving me headaches.

function convertEncoding($str) {
    return iconv(mb_detect_encoding($str), "UTF-8", $str);
}
Paŭlo Ebermann
  • 73,284
  • 20
  • 146
  • 210
rafaschutz
  • 43
  • 1
  • 1
  • 3
  • Why are you using both iconv and mbstring? Use mb_convert_encoding if you want to use multibyte string extension. – Emre Yazici Jul 02 '11 at 21:57
  • i tried it... same return... any idea? – rafaschutz Jul 02 '11 at 22:07
  • possible duplicate of [PHP: Convert any string to UTF-8 without knowing the original character set, or at least try](http://stackoverflow.com/questions/7979567/php-convert-any-string-to-utf-8-without-knowing-the-original-character-set-or) – That Brazilian Guy Aug 22 '13 at 15:25

3 Answers3

7

Rather than blindly trying to detect the encoding, you should first check if the page that you downloaded has a listed character set. The character set may be set in the HTTP response header, for example:

Content-Type:text/html; charset=utf-8

Or in the HTML as a meta tag, for example:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 

Only if neither are available then try to guess the encoding with mb_detect_encoding() or other methods.

sagi
  • 5,619
  • 1
  • 30
  • 31
  • right... i mannually inputed a "from" enconding based on the header of the of this souce [link](http://www.youtube.com/watch?feature=player_embedded&v=cp_ajN7Kqvc) . But still getting a malformed string :( – rafaschutz Jul 02 '11 at 22:32
  • The source encoding from that YouTube page is UTF-8, so there's really nothing to convert here.. – sagi Jul 02 '11 at 22:35
  • Done some others tests... getting positive results setting a from encoding :) ... tks for the tip – rafaschutz Jul 02 '11 at 22:46
5

It's not possible to detect character set of a string in 100% rate since some character sets are subset of some others. Try setting character set explicitly if possible without mixing iconv and mbstring functions. I recommend using a function like this and supplying from charset whenever possible:

function convertEncoding($str, $from = 'auto', $to = "UTF-8") {
    if($from == 'auto') $from = mb_detect_encoding($str);
    return mb_convert_encoding ($str , $to, $from); 
}
Emre Yazici
  • 10,136
  • 6
  • 48
  • 55
  • i'd tested it with your function setting the $from with the same encoding of the header of the source... same return :( – rafaschutz Jul 02 '11 at 22:23
  • im testing with the following source [link](http://www.youtube.com/watch?feature=player_embedded&v=cp_ajN7Kqvc) – rafaschutz Jul 02 '11 at 22:24
  • Dear afaschutz, please read my answer carefully. I did not claim it will work for your situation. I explained why your way is not right and offered a better way. – Emre Yazici Jul 02 '11 at 22:27
  • ive understood that... tks for the tip – rafaschutz Jul 02 '11 at 22:42
  • Also, most character sets/encodings use the same bytes as other ones, but with other meanings (like most of the ISO-8859-x ones). Sometimes one can reliably guess depending on the distribution of characters (which depends on the language), but often you will need bigger amounts of text for that to work reliably. – Paŭlo Ebermann Jul 03 '11 at 03:33
1

You can try utf_encode($str).

http://www.php.net/manual/en/function.utf8-encode.php#89789

Or you can replace the content type meta tag with

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 

from header of crawled content

Kulin Choksi
  • 761
  • 1
  • 11
  • 30