I have the following the code saved as the content for a post in an unmodified Wordpress install running TwentyNineteen theme. It contains a combination of HTML entities and hex entities.
<span>Explore the professions of some of the groundbreaking women in
science, technology, engineering and mathematics (STEM) with the LEGO® Ideas Women of NASA set. It features minifigures of 4 pioneering women of NASA— astronomer and educator Nancy Grace Roman, computer scientist and entrepreneur Margaret Hamilton, astronaut, physicist and entrepreneur Sally Ride and astronaut, physician and engineer Mae Jemison—and 3 builds illustrating their areas of expertise. Role-play space exploration from planning to moon landing, beginning with the iconic scene from Massachusetts Institute of Technology in 1969 of Hamilton with software that she and her team programmed. Build the posable Hubble Space Telescope and launch a LEGO version of the Space Shuttle Challenger with 3 removable rocket stages. The set also includes a booklet about the 4 featured women of NASA, and the fan creator and LEGO designers of this fun and educational set.</span><div class="ProductFeatures__BulletText-s6my6ry-2 UbHkR"><span><ul><li> Includes 4 minifigures: Nancy Grace Roman, Margaret Hamilton, Sally Ride and Mae Jemison.
</li><li> Features 3 LEGO® builds illustrating the areas of expertise of the 4 featured women of NASA.
</li><li> Nancy Grace Roman’s build features a posable Hubble Space Telescope with authentic details and a projected image of a planetary nebula.
</li><li> Margaret Hamilton's build features a stack of book elements, representing the books of listings of Apollo Guidance Computer (AGC) onboard flight software source code.
</li><li> Sally Ride and Mae Jemison’s build features a launchpad and Space Shuttle Challenger with 3 removable rocket stages.
</li><li> Also includes printed nameplates for each of the 4 women featured in this set.
</li><li> Great for role-playing space exploration missions.
</li><li> Includes a booklet with building instructions, plus information about the 4 featured women of NASA, the set's fan creator and the LEGO designers.
</li><li> Nancy Grace Roman's build measures over 2" (7cm) high, 3†(9cm) wide and 2†(6cm) deep.
</li><li> Margaret Hamilton’s build measures over 2" (6cm) high, 3†(8cm) wide and 1†(4cm) deep.
</li><li> Sally Ride and Mae Jemison’s build measures over 4" (12cm) high, 3†(10cm) wide and 2†(6cm) deep.</li></ul></span></div>
When the content is displayed on the front end of the website, the entities do not display correctly. It looks like some form of UTF-8 encoding issue?
I looked at this UTF-8 debug tool website and found that the erroneous characters match up to the characters I expected in the left part of the table.
I have tried:
- Applying utf8_decode() to post_content
- Applying htmlentities() to post_content
- Removing the wptexturize filter from post_content
None of these have worked for me. I think the issue is more complicated, possibly something to do with the character encoding of the original text? It came via an XML response from a Windows based system. Is this something to do with Windows character encoding?
Update
The comments on this question suggest that the issue is occurring before WordPress gets hold of it. I am requesting the content in XML from a Windows based API via cURL in PHP and then using simplexml_load_string to parse it. If I dump the XML response directly to a file, the encoding is correct. See below.
<productSpecificationDetail><![CDATA[<span>Explore the professions of some of the groundbreaking women in science, technology, engineering and mathematics (STEM) with the LEGO® Ideas Women of NASA set. It features minifigures of 4 pioneering women of NASA— astronomer and educator Nancy Grace Roman, computer scientist and entrepreneur Margaret Hamilton, astronaut, physicist and entrepreneur Sally Ride and astronaut, physician and engineer Mae Jemison—and 3 builds illustrating their areas of expertise. Role-play space exploration from planning to moon landing, beginning with the iconic scene from Massachusetts Institute of Technology in 1969 of Hamilton with software that she and her team programmed. Build the posable Hubble Space Telescope and launch a LEGO version of the Space Shuttle Challenger with 3 removable rocket stages. The set also includes a booklet about the 4 featured women of NASA, and the fan creator and LEGO designers of this fun and educational set.</span>
Which part of the process between cURL and SimpleXML could the encoding be affected in this way? I can't see that I'm specifying a particular charset anywhere during the retrieval of the content.