Scraping meta data on Japanese websites with some character encoding problems

Question

For a small project on Wordpress, I am trying to scrape some information from site given an URL (namely a thumbnail and the publisher). I know there are few plugin doing similar things but they usually inject the result in the article itself which is not my goal. Furthermore, the one I use tend to have the same issue I have.

My overall goal is to display a thumbnail and the publisher name given a URL in a post custom field. I get my data from the opengraph metatags for the moment (I'm a lazy guy).

The overall code works but I get the usual mangled text when dealing with non-latin characters (and that's 105% of the cases). Even stranger for me : it depends on the site.

I have tried to use ForceUTF8 and gzip compression in curl as recommended in various answers here but the result is still the same (or gets worse).

My only clue for the moment is how the encoding is declared on each page

For example, for 3 URL I was given:

https://www.jomo-news.co.jp/life/oricon/25919
    <meta charset="UTF-8" />
    <meta property="og:site_name" content="上毛新聞" />

Result > ä¸Šæ¯›æ–°è ž

Not OK

https://entabe.jp/21552/rl-waffle-chocolat-corocoro
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <meta property="og:site_name" content="えん食べ [グルメニュース]" />

Result > えん食べ [グルメニュース]

OK

https://prtimes.jp/main/html/rd/p/000000008.000026619.html
    <meta charset="utf-8">
    <meta property="og:site_name" content="プレスリリース・ニュースリリース配信シェアNo.1｜PR TIMES" />

Result > ãƒ—ãƒ¬ã‚¹ãƒªãƒªãƒ¼ã‚¹ãƒ»ãƒ‹ãƒ¥ãƒ¼ã‚¹ãƒªãƒªãƒ¼ã‚¹é… ä¿¡ã‚·ã‚§ã‚¢No.1ï½œPR TIMES

Not OK

For reference, the curl declaration I use

    function file_get_contents_curl($url)
        {
            header('Content-type: text/html; charset=UTF-8');
            $ch = curl_init();
    
            curl_setopt($ch, CURLOPT_HEADER, 0);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    
            $data = curl_exec($ch);
            curl_close($ch);
            return $data;
        }

And the scraping function:

function get_news_header_info($url){
  //parsing begins here:

  $news_result = array("news_img_url" => "", "news_name" => "");
  $html = file_get_contents_curl($url);
  $doc = new DOMDocument();
  @$doc->loadHTML($html);

  $metas = $doc->getElementsByTagName('meta');

  for ($i = 0; $i < $metas->length; $i++)
  {
      $meta = $metas->item($i);
      if($meta->getAttribute('property') == 'og:site_name')
        {
          if(! $news_name)
            $news_name = $meta->getAttribute('content');
        }
  //Script continues
}

Anyone knows what is different between these three cases and how I could deal with it ?

EDIT

Looks like that even though all websites declared a UTF-8 charset, after looking at the curl_getinfo() and testing a bunch of charset conversion combinaison, a conversion to ISO-8859-1 was necessary.

So just adding a

iconv("UTF-8", "ISO-8859-1", $scraped_text);

was enough to solve the problem.

For the sake of giving a complete answer, here is the snippet of code to test conversion pairs from this answer by rid-iculous

$charsets = array(  
        "UTF-8", 
        "ASCII", 
        "Windows-1252", 
        "ISO-8859-15", 
        "ISO-8859-1", 
        "ISO-8859-6", 
        "CP1256"
        ); 

foreach ($charsets as $ch1) { 
    foreach ($charsets as $ch2){ 
        echo "<h1>Combination $ch1 to $ch2 produces: </h1>".iconv($ch1, $ch2, $text_2_convert); 
    } 
}

Problem solved, have fun!

score 0 · Accepted Answer · answered Jan 15 '18 at 06:10

0

Looks like even tough all pages declared using UTF-8, some ISO-8859-1 was hidden in places. Using iconv solved the issue.

Edited the question with all the details, case closed !

answered Jan 15 '18 at 06:10

Jerome Ydarack

29
1
6

Scraping meta data on Japanese websites with some character encoding problems

1 Answers1