Scraping pages with PHP results in unexpected characters

Question

Okay so I am using PHP to scrape some data from a web page and somehow pulling in some Unexpected characters not present in the source document. I assume this is due to me interpreting the wrong character encoding though am unsure how to resolve the issue

Here's a sample piece of HTML giving me the error

<tr>
    <td>Aug 2013</td>
    <td>TEDxColbyCollege</td>
    <td>
        <a href="/talks/daniel_h_cohen_for_argument_s_sake.html">Daniel H. Cohen: For argument’s sake</a>       </td>
   . 
   . 
   . 
// more of the table

Now the resulting string I echo / store in db looks like this: Daniel H. Cohen: For argumentÃ¢ÂÂs sake

I am using the following code to load the HTML document and scrape

$html = file_get_contents('url_of_html_page_being_scrapped');
$doc = new DOMDocument();
$doc->loadHTML($html);
$sxml = simplexml_import_dom($doc);
$table = $sxml->xpath('//table');
foreach($tbl->tr as $vid)
{
 .
 .
 echo $vid->td[2]->a  // line giving me the problem
 .
 .
}

The head of the document indicates

 <!doctype html>
 <html lang="en">
 <head>
 <meta charset="utf-8">
 .
 .
 </head>

So I am assuming my method is not correctly interpreting the charset though I am unsure how I could specify this or if it is even the problem ... Also it appears that the error occurs on the value: ' any insight into what's going on / how I can fix it would be awesome as I am unsure

Update After some recommendations from @Patrick Manser I have attempted the solution's found elsewhere in S.O.

mainly:

 $html =stripslashes(mb_convert_encoding( file_get_contents('http://www.ted.com/talks/quick-list?sort=date&order=desc&page=1'), "HTML-ENTITIES", "UTF-8" ));
 //AND
 $html = mb_convert_encoding( file_get_contents('http://www.ted.com/talks/quick-list?sort=date&order=desc&page=1'), "HTML-ENTITIES", "UTF-8" );

Both resulting In the output appearing like so Daniel H. Cohen: For argumentâ€™s sake

`$html = file_get_contents('url_of_html_page_being_scrapped');` is that the page, where you put the ``? — Patrick Manser, Aug 06 '13 at 11:21
No I did not put anything there the head of the document at `url_of_html_page_being_scrapped` appears as indicated above as ` . . ` — brendosthoughts, Aug 06 '13 at 11:23
That's what I actually meant :) Well, I don't know if this will work out for you, but I had similar problems and putting an utf8_encode() around the content being loaded did the trick. I don't know if this is more of an unproper hack... But try it: `$doc->loadHTML(utf8_encode($html));` — Patrick Manser, Aug 06 '13 at 11:25
Hey, thanks for the idea , but no luck I still get the same result — brendosthoughts, Aug 06 '13 at 11:27
http://www.php.net/manual/en/domdocument.loadhtml.php#95251 does this help? — Patrick Manser, Aug 06 '13 at 11:30
or this http://stackoverflow.com/questions/2292004/getting-a-instead-of-an-apostrophe-in-php — Patrick Manser, Aug 06 '13 at 11:31
@PatrickManser hey thanks ... It looks better and near manageable using both these method's the output looks like this `Daniel H. Cohen: For argumentâs sake` ... I am trying to now use a str_replace on the â however working through a terminal I can't seem to produce charecter in the script :s — brendosthoughts, Aug 06 '13 at 11:52

score 1 · Accepted Answer · answered Aug 06 '13 at 12:19

Although the text still appears misconfigured when echoing as well as in my database table using this line in the head of the html document (when displaying the data make's the ) ' properly render

 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

score 1 · Answer 2 · edited Aug 21 '13 at 23:51

Even with the proper application of htmlspecialchars_decode(), html_entities_decode(), and mb_convert_encoding(), this problem is pretty difficult to get rid of.

I use a modified version of Sebastián Grignoli's forceUTF8() function to fully clean up strings. I know of nothing else like it for php.

You can find one version of the function here on github.

If you really need a full clean-up regardless of characters involved, this gives amazing results.

The following are examples from the readme.

An example usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÃ©dÃÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÃÃ©dÃÃÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÃÃÃ©dÃÃÃÃ©ration Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

EDIT

Also, note that if you are using a web-based DB browser (like phpMyAdmin) you may encounter character discrepancies between the character encoding stored in the DB and the encoding defined by the web-page. I've had cases where what's stored in the DB is entirely correct, but it just looks wrong from the interface.

Thanks for advice I tried it out and I'm still not getting properly encoded string's returned, it appear's that there is already an issue opened on the project for it , and will keep an eye on it for possible use in the future! — brendosthoughts, Aug 07 '13 at 14:07
Glad to help! Also, if the open issue in question is the [non-breaking space issue](https://github.com/neitanod/forceutf8/issues/9), I seem to remember using a [unicode preg_replace](http://www.php.net/manual/en/regexp.reference.unicode.php) to convert those characters to something manageable (ie: `preg_replace('/\p{Zs}/', ' ', $htmlString)`). Though this seems strange if it's your problem. — David, Aug 07 '13 at 15:01

Scraping pages with PHP results in unexpected characters

2 Answers2