0

I have a method which takes an html string and loops through each html tag, adding the text contents to an associative array, which is then json_encoded into JSON file form.

For some reason the JSON file I create has weird characters like you can see in the photo.

screenshot

Storage::disk('public')->put($fileName, json_encode($newArray, JSON_UNESCAPED_UNICODE | JSON_PRETTY_PRINT));

My full method:

        $htmlString = '<section>
        <h2>CCPA Privacy Notice Addendum</h2>
        <p>This California Consumer Privacy Act (CCPA) Privacy Notice Addendum supplements the information provided in the [App Name] Privacy Policy and applies solely to residents of the State of California ("consumers" or "you"). We adopt this addendum to comply with the CCPA and provide you with the required information about your rights under the CCPA.</p>
    </section>';
        
        $dom = new \DOMDocument();
        libxml_use_internal_errors(true);
        $dom->loadHTML($htmlString);

        $count = 0;
        $keyPattern = 'ccpaRights';
        $newArray = [];

        foreach ($dom->getElementsByTagName('section') as $section)
        {

            // loop through each child of <section>
            foreach ($section->childNodes as $childNode)
            { 
                $nodeValue = $childNode->nodeValue; 

                if ($nodeValue === '' )
                {
                    continue; 
                }

                $count = $count + 1;
                $key = (string) $keyPattern.'Text'.$count;

                $newArray[$key] = $nodeValue;
            }
        }


        $fileName = '/temp/translated-'.rand(1,1000).'.json';

        Storage::disk('public')->put($fileName, json_encode($newArray, JSON_UNESCAPED_UNICODE | JSON_PRETTY_PRINT));
Moshe Katz
  • 15,992
  • 7
  • 69
  • 116
gabogabans
  • 3,035
  • 5
  • 33
  • 81
  • 1
    Please copy the text into the question, not just a screenshot. Include both versions of the text - what you expect to see and also what's in the output file. – Moshe Katz May 08 '23 at 00:00
  • 1
    What you're using to _view_ that data is not treating it as UTF8, it's treating it as either ISO8859-1 or cp1252. Change your terminal and/or editor settings. This is also why high-order unicode codepoints are escaped by default, so they can survive encoding mismatches in transit. – Sammitch May 08 '23 at 00:36

2 Answers2

2

DomDocument's loadHTML() method will load the markup with the ISO-8859-1 character set by default if a character encoding is not explicitly stated.

That said, the link I provided uses an out-of-date method to fix this. The functions used have been deprecated as of PHP 8.2 and may be removed in 8.3+.

The 8.3-compatible alternative used by most frameworks is

htmlspecialchars_decode(
    iconv(
        'UTF-8',
        'ISO-8859-1',
        htmlentities($htmlString, ENT_COMPAT, 'UTF-8')
    ),
    ENT_QUOTES
)

Adding that to your code,

        $htmlString = '<section>
        <h2>CCPA Privacy Notice Addendum</h2>
        <p>This California Consumer Privacy Act (CCPA) Privacy Notice Addendum supplements the information provided in the [App Name] Privacy Policy and applies solely to residents of the State of California ("consumers" or "you"). We adopt this addendum to comply with the CCPA and provide you with the required information about your rights under the CCPA.</p>
    </section>';
        
        $dom = new \DOMDocument();
        libxml_use_internal_errors(true);
        $dom->loadHTML(htmlspecialchars_decode(iconv('UTF-8', 'ISO-8859-1', htmlentities($htmlString, ENT_COMPAT, 'UTF-8')), ENT_QUOTES));

        $count = 0;
        $keyPattern = 'ccpaRights';
        $newArray = [];

        foreach ($dom->getElementsByTagName('section') as $section)
        {

            // loop through each child of <section>
            foreach ($section->childNodes as $childNode)
            { 
                $nodeValue = $childNode->nodeValue; 

                if ($nodeValue === '' )
                {
                    continue; 
                }

                $count = $count + 1;
                $key = (string) $keyPattern.'Text'.$count;

                $newArray[$key] = $nodeValue;
            }
        }


        $fileName = '/temp/translated-'.rand(1,1000).'.json';

        Storage::disk('public')->put($fileName, json_encode($newArray, JSON_UNESCAPED_UNICODE | JSON_PRETTY_PRINT));

Should output a properly-encoded JSON file.

Jim
  • 3,210
  • 2
  • 17
  • 23
2

The loadHTML method doesn't load the HTML string in a UTF-8 format. So, one simple way to overcome this is replacing this line:

$dom->loadHTML($htmlString);

With this:

 $dom->loadHTML('<!DOCTYPE html><meta charset="UTF-8">' . $htmlString);
MElhalees
  • 75
  • 7