PHP DOMDocument Japanese Character encoding issue

Question

I have a file called: ニューヨーク・ヤンキース-チケット-200x225.jpg

I am able to successfully do this with my PHP code:

    if (file_exists(ABSPATH . 'ニューヨーク・ヤンキース-チケット-200x225.jpg')) {
    echo 'yes';
}

However, when I parse my content using DOMDocument, that same string is returned as: ãã¥ã¼ã¨ã¼ã¯ã»ã¤ã³ãã¼ã¹-ãã±ãã-200x225.jpg

How do I prevent this happening with the following code? Our application is internationalised so we need to accomodate all utf-8 characters:

$dom = new DOMDocument();
$dom->encoding = 'utf-8';
$dom->loadHTML($content);
$images = $dom->getElementsByTagName('img');

foreach ($images as $image) {
    if( $image->hasAttribute('srcset') ) continue;
    echo $initImgSrc = $image->getAttribute('src');
    if (!preg_match('/[_-]\d+x\d+(?=\.[a-z]{3,4}$)/', $initImgSrc)) continue;

    $newImgSrc = preg_replace('/[_-]\d+x\d+(?=\.[a-z]{3,4}$)/', '', $initImgSrc);
    if (strpos($newImgSrc, '/') === 0) {
        $newImgPath = str_replace( '/wp-content', ABSPATH . 'wp-content', $newImgSrc);
    } else {
        $newImgPath = str_replace( get_home_url(), ABSPATH, $newImgSrc);
    }
    if (!file_exists($newImgPath)) continue;
    echo 'yes';
    $dom->saveXML($image);

    $oldSrc = 'src="' . $initImgSrc . '"';
    $newDataSrcSet = $initImgSrc . ' 1x, ' . $newImgSrc . ' 2x';
    $newSrcWithSrcSet = $oldSrc . ' srcset="' . $newDataSrcSet .'"';
    $content  = str_replace( $oldSrc, $newSrcWithSrcSet, $content );
}
return $content;

This code works normally, just not with the Japanese characters. Any help would be immensely appreciated

score 1 · Accepted Answer · answered Aug 30 '19 at 15:08

DOMDocument::loadHTML will treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();

This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.

Answer copied from here: PHP DOMDocument loadHTML not encoding UTF-8 correctly

If you think the answer for this problem is basically looking at another question(and answer) then you should say this is a duplicate in the comments section. The other question may have other answers which may also help in this situation and therefore the question should just be linked rather than duplicating the answers. — Nigel Ren, Aug 30 '19 at 15:19
@rkg Thanks, the first answer was correct: worked immediately. I would've thought $dom->encoding would do the same thing, but I guess that's to encode before a save rather than a declaration. Thanks all! — James Cartwright, Sep 02 '19 at 07:09
@NigelRen thanks for the tips. I just happened to be just recently active here and just trying to help. I'll do better next time. — rkg, Sep 02 '19 at 08:01

PHP DOMDocument Japanese Character encoding issue

1 Answers1