Parsing HTML with Arabic characters yields strange result like "ÃÃ‘Ãš"

Question

We are trying to parse HTML like this:

<li><a class="newsMarquee" href="http://www.lebanonfiles.com/news/617843">مستخدمو &quot;كهرباء لبنان&quot;: الاضراب مستمر حتى إقرار موازنة 2013 الخاصة بنا</a></li>
                                                            <li><a class="newsMarquee" href="http://www.lebanonfiles.com/news/617840">اجتماع برئاسة محافظ الجنوب بحث في اوضاع النازحين</a></li>

We are getting this as result:

ÃÃ‘ÃšÃÃ‰ Ã‡Ã¡Ã¡ÃŒÃ‡Ã¤ ÃÃ‘Ã“ÃŠ Ã‹Ã¡Ã‡Ã‹Ã‰ Ã£Ã”Ã‡Ã‘ÃÃš ÃˆÃÃ†ÃÃ‰ Ã£Ã“ÃŠÃŽÃÃ£Ã¦ "ÃŸÃ¥Ã‘ÃˆÃ‡Ã Ã¡ÃˆÃ¤Ã‡Ã¤": Ã‡Ã¡Ã‡Ã–Ã‘Ã‡Ãˆ Ã£Ã“ÃŠÃ£Ã‘ ÃÃŠÃ¬ Ã…ÃžÃ‘Ã‡Ã‘ Ã£Ã¦Ã‡Ã’Ã¤Ã‰ 2013 Ã‡Ã¡ÃŽÃ‡Ã•Ã‰ ÃˆÃ¤Ã‡

And we have used: header("Content-Type: text/html; charset=utf-8"); Any Suggestions?

This is the Code:

<?php

echo '<html><head>';
header("Content-Type: text/html; charset=utf-8");

echo '</head>';


echo '<body>';
$dom = new DOMDocument('1.0');
@$dom->loadHTMLFile($url);



$params = $dom->getElementsByTagName('div'); // Find Sections

$k=0;

foreach ($params as $param) //go to each Article 1 by 1

{


if($params->item($k)->getAttribute('class') == 'no-js')


{
    $params2 = $params->item($k)->getElementsByTagName('a');
    $i=0;

    while($params2->item($i)->getAttribute('class') == 'newsMarquee')
    {
        if($params2->item($i)->getAttribute('class') != 'newsMarquee')
            break; 
        else
        {
            echo '' .$params2->item($i)->nodeValue. '<br/>';
            //echo 'Link: '.$params2->item($i)->getAttribute('href').'<br/><br/>';
            $i++;

        }
    }
}
$k++;
}

echo '</body>';
echo '</html>';
?>

can you show the header of your XML file (especially content type used) — Jeroen, Oct 22 '13 at 12:34
possible duplicate of [PHP DomDocument failing to handle utf-8 characters (☆)](http://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters) - If you have XML, why do you use loadHTML? — hakre, Oct 22 '13 at 13:15
`loadHTML` expects Latin-1 character-encoding by default (because that is HTML 4 default). The document you load does provide the character-encoding in HTTP headers only that's why it's not reflected by `loadHTML` which then tries with Latin-1 and you then see it that way. Instead hint your encoding with `loadHTML` as it has been outlined in the given answer: http://stackoverflow.com/a/11310258/367456 — hakre, Oct 22 '13 at 21:28
Also your example is incomplete and also perhaps just too long. It can also benefit by being more properly formatted. Also `$url` remains undefined. Best thing is you create a little example script that demonstrates your issue and then describe what you think the issue is in your own words and why suggested ways to solve it in related Q&A here on the website didn't make it for you so far. — hakre, Oct 22 '13 at 21:32

score 1 · Answer 1 · answered Oct 22 '13 at 12:45

1

Your source

http://www.lebanonfiles.com/news/617843

isn't using the UTF-8 character set; it's using Windows-1256 (Arabic)*.

Try using Windows-1256 as the second argument to your DOMDocument call:

$dom = new DOMDocument('1.0', 'Windows-1256');

_{* for future reference: I found this out by opening the URL in my browser and went to the "Encoding" menu - that's the encoding the browser uses. You can also look in the "Net" tab of your browser's developer tools and see what Content-type the page is returning.}

answered Oct 22 '13 at 12:45

Pekka

442,112
142
972
1,088

It didn't work man. I tried $dom = new DOMDocument('1.0', 'Windows-1256'); and $dom = new DOMDocument('1.0', 'utf-8'); – ThePunisher Oct 22 '13 at 12:48
1

Hmm, maybe DOMDocument doesn't understand "Windows-1256" then. What I'd do is load the document using `file_get_contents()` instead, convert it to UTF-8 using `iconv("windows-1256", "utf-8", $content);`, and then load it using `loadHTML()` – Pekka Oct 22 '13 at 12:50
I will try that and tell you what happend – ThePunisher Oct 22 '13 at 12:52
I tried what u said. Still no Result – ThePunisher Oct 22 '13 at 13:35
Understand the answer given in the duplicate question, it contains the solution. And the assumption lined-out in this answer here is just wrong, `new DOMDocument('1.0', 'Windows-1256');` is useless here, that only will change what will be put in the [XML declaration](http://xmlwriter.net/xml_guide/xml_declaration.shtml) if you output the DOMDocument as XML document. This setting can also trigger some output encoding, but that is more or less transparent (you will see PHP warnings if there is a problem with an encoding then). – hakre Oct 22 '13 at 21:30
@hakre argh, silly me, of course. Well, your answer says it all, going to delete this soon. Why does this have 4 reopen votes though? – Pekka Oct 22 '13 at 22:40
1

The only clue I have about the reopen votes is that it was closed On Hold and the OP then edited. Often this is enough to trigger voting reflexes in the review queue w/o others reviewing all close reasons and if the edit was any good etc. – hakre Oct 23 '13 at 14:39

score 1 · Answer 2 · answered Oct 22 '13 at 12:57

1

Check the encoding from the source as Pekka says.

The line

header("Content-Type: text/html; charset=utf-8");

has no impact when you read the xml file. This line only update the output of your webpage.

answered Oct 22 '13 at 12:57

Kevin Marie - Eode9

11
3

Parsing HTML with Arabic characters yields strange result like "ÃÃ‘Ãš"

2 Answers2