1

I web scrape (using curl) a page and try to retrive LD-Json content.

So first I get the content of the page:

  $handle = curl_init();
  curl_setopt($handle, CURLOPT_URL, $url);
  curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($handle, CURLOPT_FOLLOWLOCATION, true);

  $page = curl_exec($handle);
  curl_close($handle);

and it works ok.

I check the $data content in a hex editor and see that the page is encoded correctly as UTF-8. For example characters "ół" are encoded as "C3 B3 C5 82" which is OK.

The problem starts when I query for ld-json scripts:

  $dom = new DOMDocument();
  @$dom->loadHTML($page);
  $xpath = new DOMXpath($dom);
  $jsonScripts = $xpath->query( '//script[@type="application/ld+json"]' );

and then

      foreach ($jsonScripts as $jScript)
      {
          $json = $jScript->nodeValue;
          $data = json_decode($cleared, true);

suddenly the same characters are now encoded as "C3 83 C2 B3 C3 85 C2 82"

What just happend?

Pepe
  • 431
  • 5
  • 14
  • 2
    See [this](https://stackoverflow.com/a/8218649/231316) which says that DOMDocument works in ISO-8859-1 be default and you need to kick it into UTF-8 mode. It is possible that the site you are loading this from has that in the HTTP header and not the HTML, too. – Chris Haas Sep 19 '21 at 14:49
  • @ChrisHaas - thank you 100 times. Indeed, there was a problem with the document. The character set was defined as not – Pepe Sep 19 '21 at 15:14
  • 1
    I’m glad that worked! Instead of editing the question with your answer, please roll that back and instead post it as an answer and accept that. – Chris Haas Sep 19 '21 at 15:49

1 Answers1

2

SOLVED

The problem was in the scraped page. The character set was defined as

<meta charset=UTF-8>

not

<meta charset="UTF-8">

The workaround was to change the code to:

  @$dom->loadHTML('<?xml encoding="utf-8" ?>'.$page);

Thank you @ChrisHaas!

Pepe
  • 431
  • 5
  • 14