-3

We are trying to parse HTML like this:

<li><a class="newsMarquee" href="http://www.lebanonfiles.com/news/617843">مستخدمو &quot;كهرباء لبنان&quot;: الاضراب مستمر حتى إقرار موازنة 2013 الخاصة بنا</a></li>
                                                            <li><a class="newsMarquee" href="http://www.lebanonfiles.com/news/617840">اجتماع برئاسة محافظ الجنوب بحث في اوضاع النازحين</a></li>

We are getting this as result:

ÃÑÚíÉ ÇááÌÇä ÃÑÓÊ ËáÇËÉ ãÔÇÑíÚ ÈíÆíÉ ãÓÊÎÃãæ "ßåÑÈÇà áÈäÇä": ÇáÇÖÑÇÈ ãÓÊãÑ ÃÊì ÅÞÑÇÑ ãæÇÒäÉ 2013 ÇáÎÇÕÉ 銂

And we have used: header("Content-Type: text/html; charset=utf-8"); Any Suggestions?

This is the Code:

<?php

echo '<html><head>';
header("Content-Type: text/html; charset=utf-8");

echo '</head>';


echo '<body>';
$dom = new DOMDocument('1.0');
@$dom->loadHTMLFile($url);



$params = $dom->getElementsByTagName('div'); // Find Sections

$k=0;

foreach ($params as $param) //go to each Article 1 by 1

{


if($params->item($k)->getAttribute('class') == 'no-js')


{
    $params2 = $params->item($k)->getElementsByTagName('a');
    $i=0;

    while($params2->item($i)->getAttribute('class') == 'newsMarquee')
    {
        if($params2->item($i)->getAttribute('class') != 'newsMarquee')
            break; 
        else
        {
            echo '' .$params2->item($i)->nodeValue. '<br/>';
            //echo 'Link: '.$params2->item($i)->getAttribute('href').'<br/><br/>';
            $i++;

        }
    }
}
$k++;
}

echo '</body>';
echo '</html>';
?>
ThePunisher
  • 410
  • 1
  • 4
  • 14
  • Can you show the code you're using? – Pekka Oct 22 '13 at 12:31
  • can you show the header of your XML file (especially content type used) – Jeroen Oct 22 '13 at 12:34
  • I Have edited the question, and put the Code – ThePunisher Oct 22 '13 at 12:37
  • possible duplicate of [PHP DomDocument failing to handle utf-8 characters (☆)](http://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters) - If you have XML, why do you use loadHTML? – hakre Oct 22 '13 at 13:15
  • I edited the Question. Parsing HTML and not XML – ThePunisher Oct 22 '13 at 13:47
  • `loadHTML` expects Latin-1 character-encoding by default (because that is HTML 4 default). The document you load does provide the character-encoding in HTTP headers only that's why it's not reflected by `loadHTML` which then tries with Latin-1 and you then see it that way. Instead hint your encoding with `loadHTML` as it has been outlined in the given answer: http://stackoverflow.com/a/11310258/367456 – hakre Oct 22 '13 at 21:28
  • Also your example is incomplete and also perhaps just too long. It can also benefit by being more properly formatted. Also `$url` remains undefined. Best thing is you create a little example script that demonstrates your issue and then describe what you think the issue is in your own words and why suggested ways to solve it in related Q&A here on the website didn't make it for you so far. – hakre Oct 22 '13 at 21:32

2 Answers2

1

Your source

http://www.lebanonfiles.com/news/617843

isn't using the UTF-8 character set; it's using Windows-1256 (Arabic)*.

Try using Windows-1256 as the second argument to your DOMDocument call:

$dom = new DOMDocument('1.0', 'Windows-1256');

* for future reference: I found this out by opening the URL in my browser and went to the "Encoding" menu - that's the encoding the browser uses. You can also look in the "Net" tab of your browser's developer tools and see what Content-type the page is returning.

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • It didn't work man. I tried $dom = new DOMDocument('1.0', 'Windows-1256'); and $dom = new DOMDocument('1.0', 'utf-8'); – ThePunisher Oct 22 '13 at 12:48
  • 1
    Hmm, maybe DOMDocument doesn't understand "Windows-1256" then. What I'd do is load the document using `file_get_contents()` instead, convert it to UTF-8 using `iconv("windows-1256", "utf-8", $content);`, and then load it using `loadHTML()` – Pekka Oct 22 '13 at 12:50
  • I will try that and tell you what happend – ThePunisher Oct 22 '13 at 12:52
  • I tried what u said. Still no Result – ThePunisher Oct 22 '13 at 13:35
  • Understand the answer given in the duplicate question, it contains the solution. And the assumption lined-out in this answer here is just wrong, `new DOMDocument('1.0', 'Windows-1256');` is useless here, that only will change what will be put in the [XML declaration](http://xmlwriter.net/xml_guide/xml_declaration.shtml) if you output the DOMDocument as XML document. This setting can also trigger some output encoding, but that is more or less transparent (you will see PHP warnings if there is a problem with an encoding then). – hakre Oct 22 '13 at 21:30
  • @hakre argh, silly me, of course. Well, your answer says it all, going to delete this soon. Why does this have 4 reopen votes though? – Pekka Oct 22 '13 at 22:40
  • 1
    The only clue I have about the reopen votes is that it was closed On Hold and the OP then edited. Often this is enough to trigger voting reflexes in the review queue w/o others reviewing all close reasons and if the edit was any good etc. – hakre Oct 23 '13 at 14:39
1

Check the encoding from the source as Pekka says.

The line

header("Content-Type: text/html; charset=utf-8");

has no impact when you read the xml file. This line only update the output of your webpage.