0

I am working with exporting accented characters from a mySQL database to XML, but I am getting really wonky results.

For the basics - the mySQL table is set up as latin-1 encoding. Not ideal. However, all input is run through HTML entities, which seems to be working great; I can read data back all day long, and it looks correct on the screen.

Here is a sample item.

On the screen, it looks like this:

me hace reír

Note the accented "i" character (with acute accent).

In the database, it is stored like this:

me hace reír

The "i" with the acute is properly replaced with the HTML entity, which allows for proper display on screen. If I wrap that inside of a textarea, it still reads correctly - no acute HTML entity, just he correct accented "i" character.

My XML file has a proper UTF-8 header on it:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?

But when I read the data from the DB and export it to the XML...

$xml.="<dedicatedBecause>".($dedicatedbecause)."</dedicatedBecause>"."\n";

With "$dedicatedbecause" holding a totally unprocessed piece of data from the DB, I get the following in my XML file:

me hace reí-r

In other words, a DIFFERENT accent character plus a dash. In other cases, I get other nonsense characters (copyright symbol, various other accents, etc, etc).

I have a huge function for massaging data to UTF-8, but it doesn't seem to matter. If I turn it off, I get the same result.

What gives? What am I missing here?

Thanks for your help!

osuddeth
  • 152
  • 9
  • That looks like proper UTF-8 to me. How are you reading the file to check it? – Sami Kuhmonen Jan 05 '20 at 22:13
  • With `export it to the XML` are you using a parser? Can you please add more of the related PHP code? – user3783243 Jan 05 '20 at 22:52
  • I am opening the XML file in Vim, also have tried in notepad and notepad plus. All show the same. In addition, when I pass the file along to another script that reads it to create PDFs, it pulls in these same wonky characters. – osuddeth Jan 06 '20 at 03:09
  • At this point, I've pulled out all other parse logic aside from html_entity_decode. Echoing before and after gives me the "í" versus the accented "i" character. But that accented i - which should be legal UTF - breaks into the XML. All I'm doing is a file put contents of the XML string I've concatenated along the way. – osuddeth Jan 06 '20 at 03:57
  • Okay, I have a little more info. If I use `html_entity_decode($string, ENT_COMPAT, "ISO8859-1");` to cast it to 8859-1, it works great. The XML looks perfect. Except that when I try to load in the XML to the other half of the application (PDF creation via fPDF/fPDI), it chokes on the non UTF-8 input. If I use `html_entity_decode($string, ENT_COMPAT, "UTF-8");`, it makes the nonsensical XML. That'll generate a PDF all right... with nonsensical accent characters. Double you tee eff. – osuddeth Jan 06 '20 at 05:10
  • Never store HTML entities in your database. Here you are seeing why. You are imposing restrictions from the display part of your application onto other parts. The fact that the function is called `*html*_entity_decode` should tell you it's only used for HTML, not XML or SQL. (Never mind the fact that all modern browsers can display multibyte characters, and entities are not needed at all.) – miken32 Jan 22 '20 at 21:54

1 Answers1

0

&iacute; is a named (X)HTML entity. They are not known/valid in basic, wellformed XML. Converting it to UTF-8 is the right way. But it looks at some point you treat the UTF-8 string with the decoded entity as Latin-1. The à is a typical symptom.

Here is a demo provoking the behavior:

$data = 'me hace re&iacute;r';

$decoded = html_entity_decode($data, ENT_COMPAT, "UTF-8");
$treatedAsLatin1 = utf8_encode($decoded);

var_dump(
    $decoded, $treatedAsLatin1
);

Output:

string(13) "me hace reír"
string(15) "me hace reír"

utf8_encode() is an old PHP function that converts a Latin-1 string to UTF-8. However this can happen in the browser as well (depending on your HTTP headers).

ThW
  • 19,120
  • 3
  • 22
  • 44