Okay so I am using PHP to scrape some data from a web page and somehow pulling in some Unexpected characters not present in the source document. I assume this is due to me interpreting the wrong character encoding though am unsure how to resolve the issue
Here's a sample piece of HTML giving me the error
<tr>
<td>Aug 2013</td>
<td>TEDxColbyCollege</td>
<td>
<a href="/talks/daniel_h_cohen_for_argument_s_sake.html">Daniel H. Cohen: For argument’s sake</a> </td>
.
.
.
// more of the table
Now the resulting string I echo / store in db looks like this: Daniel H. Cohen: For argumentâÂÂs sake
I am using the following code to load the HTML document and scrape
$html = file_get_contents('url_of_html_page_being_scrapped');
$doc = new DOMDocument();
$doc->loadHTML($html);
$sxml = simplexml_import_dom($doc);
$table = $sxml->xpath('//table');
foreach($tbl->tr as $vid)
{
.
.
echo $vid->td[2]->a // line giving me the problem
.
.
}
The head of the document indicates
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
.
.
</head>
So I am assuming my method is not correctly interpreting the charset though I am unsure how I could specify this or if it is even the problem ... Also it appears that the error occurs on the value: '
any insight into what's going on / how I can fix it would be awesome as I am unsure
Update After some recommendations from @Patrick Manser I have attempted the solution's found elsewhere in S.O.
mainly:
$html =stripslashes(mb_convert_encoding( file_get_contents('http://www.ted.com/talks/quick-list?sort=date&order=desc&page=1'), "HTML-ENTITIES", "UTF-8" ));
//AND
$html = mb_convert_encoding( file_get_contents('http://www.ted.com/talks/quick-list?sort=date&order=desc&page=1'), "HTML-ENTITIES", "UTF-8" );
Both resulting In the output appearing like so Daniel H. Cohen: For argument’s sake