0

I am trying to parse the Guardian RSS feed (Link). The feed contains curved quotes (” ’ “ ‘), dash (-) and characters with accents (Orbán).

When I parse & display the text on a HTML page, these characters show as â (for quotes & dash), á (for á) & so on in the 'description' section. How do I make them parse properly?

Code

$xml = simplexml_load_file($link);
    for($i = 0; $i < 30; $i++){
        $title = $xml->channel->item[$i]->title;
        $description = $xml->channel->item[$i]->description;
        $count = 0;
        $para = "";
        $doc = new DOMDocument();
        @$doc->loadHTML($description);
        while($count<3){
              if($count==0){
                  $para = $doc->getElementsByTagName('p')->item($count)->nodeValue;
              }else{
                  $para = $para."<br><br>".$doc->getElementsByTagName('p')->item($count)->nodeValue;
              }
              $count++;
        }
        echo "<tr>";
        echo "<td>" . $title . "</td>";
        echo "<td>" . $para . "</td>";
        echo "</tr>";
     }

I have the below line in my 'head' section.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

enter image description here

The title section shows properly. It might be because they use straight quotes (') in title & curved (‘) in description. But as you can see á is also showing correctly in title.

user3884753
  • 255
  • 6
  • 16

1 Answers1

1

The problem was with the loadHTML line. It does not treat the text as UTF-8 unless specified.

I replaced this line

@$doc->loadHTML($description);

with this line

@$doc->loadHTML('<?xml encoding="utf-8" ?>'.$description);

Check the original answer here.

user3884753
  • 255
  • 6
  • 16