0

I have the following the code saved as the content for a post in an unmodified Wordpress install running TwentyNineteen theme. It contains a combination of HTML entities and hex entities.

<span>Explore the professions of some of the groundbreaking women in 
science, technology, engineering and mathematics (STEM) with the LEGO&Acirc;&reg; Ideas Women of NASA set. It features minifigures of 4 pioneering women of NASA&acirc;&#8364;&#8221; astronomer and educator Nancy Grace Roman, computer scientist and entrepreneur Margaret Hamilton, astronaut, physicist and entrepreneur Sally Ride and astronaut, physician and engineer Mae Jemison&acirc;&#8364;&#8221;and 3 builds illustrating their areas of expertise. Role-play space exploration from planning to moon landing, beginning with the iconic scene from Massachusetts Institute of Technology in 1969 of Hamilton with software that she and her team programmed. Build the posable Hubble Space Telescope and launch a LEGO version of the Space Shuttle Challenger with 3 removable rocket stages. The set also includes a booklet about the 4 featured women of NASA, and the fan creator and LEGO designers of this fun and educational set.</span><div class="ProductFeatures__BulletText-s6my6ry-2 UbHkR"><span><ul><li> Includes 4 minifigures: Nancy Grace Roman, Margaret Hamilton, Sally Ride and Mae Jemison.
</li><li> Features 3 LEGO&Acirc;&reg; builds illustrating the areas of expertise of the 4 featured women of NASA.
</li><li> Nancy Grace Roman&acirc;&#8364;&#8482;s build features a posable Hubble Space Telescope with authentic details and a projected image of a planetary nebula.
</li><li> Margaret Hamilton's build features a stack of book elements, representing the books of listings of Apollo Guidance Computer (AGC) onboard flight software source code.
</li><li> Sally Ride and Mae Jemison&acirc;&#8364;&#8482;s build features a launchpad and Space Shuttle Challenger with 3 removable rocket stages.
</li><li> Also includes printed nameplates for each of the 4 women featured in this set.
</li><li> Great for role-playing space exploration missions.
</li><li> Includes a booklet with building instructions, plus information about the 4 featured women of NASA, the set's fan creator and the LEGO designers.
</li><li> Nancy Grace Roman's build measures over 2" (7cm) high, 3&acirc;&#8364; (9cm) wide and 2&acirc;&#8364; (6cm) deep.
</li><li> Margaret Hamilton&acirc;&#8364;&#8482;s build measures over 2" (6cm) high, 3&acirc;&#8364; (8cm) wide and 1&acirc;&#8364; (4cm) deep.
</li><li> Sally Ride and Mae Jemison&acirc;&#8364;&#8482;s build measures over 4" (12cm) high, 3&acirc;&#8364; (10cm) wide and 2&acirc;&#8364; (6cm) deep.</li></ul></span></div>

When the content is displayed on the front end of the website, the entities do not display correctly. It looks like some form of UTF-8 encoding issue?

enter image description here

I looked at this UTF-8 debug tool website and found that the erroneous characters match up to the characters I expected in the left part of the table.

enter image description here

I have tried:

  • Applying utf8_decode() to post_content
  • Applying htmlentities() to post_content
  • Removing the wptexturize filter from post_content

None of these have worked for me. I think the issue is more complicated, possibly something to do with the character encoding of the original text? It came via an XML response from a Windows based system. Is this something to do with Windows character encoding?

Update

The comments on this question suggest that the issue is occurring before WordPress gets hold of it. I am requesting the content in XML from a Windows based API via cURL in PHP and then using simplexml_load_string to parse it. If I dump the XML response directly to a file, the encoding is correct. See below.

<productSpecificationDetail><![CDATA[<span>Explore the professions of some of the groundbreaking women in science, technology, engineering and mathematics (STEM) with the LEGO® Ideas Women of NASA set. It features minifigures of 4 pioneering women of NASA— astronomer and educator Nancy Grace Roman, computer scientist and entrepreneur Margaret Hamilton, astronaut, physicist and entrepreneur Sally Ride and astronaut, physician and engineer Mae Jemison—and 3 builds illustrating their areas of expertise. Role-play space exploration from planning to moon landing, beginning with the iconic scene from Massachusetts Institute of Technology in 1969 of Hamilton with software that she and her team programmed. Build the posable Hubble Space Telescope and launch a LEGO version of the Space Shuttle Challenger with 3 removable rocket stages. The set also includes a booklet about the 4 featured women of NASA, and the fan creator and LEGO designers of this fun and educational set.</span>

Which part of the process between cURL and SimpleXML could the encoding be affected in this way? I can't see that I'm specifying a particular charset anywhere during the retrieval of the content.

Robbie Lewis
  • 2,844
  • 3
  • 18
  • 29
  • That *looks* like UTF-8 being displayed in 8859-1, e.g. there are additional bits per character so you get *more* than expected. Windows uses a subset of 8859-1 (Windows 1252) - or at least it used to... so that's unlikely to be the issue. It *might* actually be something simpler, like the WP theme is setting the charset of the HTML to 8859-1. – CD001 Jan 22 '19 at 16:02
  • This might be a useful resource https://stackoverflow.com/a/3521340/7077417 – dan webb Jan 22 '19 at 16:14
  • Thanks @CD001, here is the resulting page on the Wordpress install: http://plugins.stodev.co.uk/test/ - the document charset is UTF-8 – Robbie Lewis Jan 22 '19 at 16:16
  • Hmmm... actually it looks like it's happening *earlier* than the HTML output - you've got data like `LEGO®` which implies a string that's been juggled from UTF-8 to 8859-1 and then encoded *before* being sent to the client ... so it would be something in the PHP. – CD001 Jan 22 '19 at 16:20
  • @CD001 I get the content via a cURL request from a Windows based API. I convert it to a SimpleXML object using simplexml_load_string() in PHP. I've looked through my code and I can't see that I am specifying a particular charset at any point during the process. Perhaps this is the problem? – Robbie Lewis Jan 22 '19 at 16:24
  • 1
    Theoretically then, the issue could even be the API you're retrieving the data from or the `CURLOPT_ENCODING` ... I think you're going to need to dump the string out and `exit()` the script as you're going to see if you can find out exactly *where* it's going wonky. – CD001 Jan 22 '19 at 16:28
  • @CD001 Thanks for your continued help on this. There's a final stage I didn't mention. Finally, I'm using the content to create or update a product in Woocommerce using the Woocommerce REST API. The encoding seems to be correct right up until I POST the information to that API. – Robbie Lewis Jan 22 '19 at 16:36

0 Answers0