I web scrape (using curl) a page and try to retrive LD-Json content.
So first I get the content of the page:
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, $url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_FOLLOWLOCATION, true);
$page = curl_exec($handle);
curl_close($handle);
and it works ok.
I check the $data content in a hex editor and see that the page is encoded correctly as UTF-8. For example characters "ół" are encoded as "C3 B3 C5 82" which is OK.
The problem starts when I query for ld-json scripts:
$dom = new DOMDocument();
@$dom->loadHTML($page);
$xpath = new DOMXpath($dom);
$jsonScripts = $xpath->query( '//script[@type="application/ld+json"]' );
and then
foreach ($jsonScripts as $jScript)
{
$json = $jScript->nodeValue;
$data = json_decode($cleared, true);
suddenly the same characters are now encoded as "C3 83 C2 B3 C3 85 C2 82"
What just happend?