1

Ok, I hope somebody can help because I haven't been able to find a solution for this.

In the database, customers managed to import or otherwise add character data from a different character set such as:

  <E2><80><99>

I believe this is UTF-16.

The XML output of my script is throwing errors due to this data (which pulls strings, such as a "description" field, from the database and builds an XML file).

XML Parsing Error: not well-formed Line Number 20, Column 50.

There's some other hex that's longer, like <80><99> (just an example, im not sure if this is an actual character).

How can I make my XML file valid, and either downscale the character set or get it to use UTF-32 like so:

  AddType "application/xml; charset=UTF-32" xml  (in .htaccess file along with filesmatch .xml)


  <?xml version='1.0' encoding='UTF-32' ?>   (placed in head of xml file)
Paul Cravey
  • 37
  • 1
  • 5
  • 1
    `0xe2 0x80 0x99` is UTF-8 for 'RIGHT SINGLE QUOTATION MARK' (U+2019), I think your problem lies elsewhere. Could you show us the first few lines of xml? – Anders Lindahl May 02 '12 at 10:20
  • Where are you seeing these errors? Do you have a example URL? What is Line Number 20? What is at Column 50? – hakre May 02 '12 at 10:27
  • Here's an example: XML Parsing Error: not well-formed Location: http://x.x.x.x/xml/hal-default.xml Line Number 20, Column 50: The Hangmans Creek Ranch is a 190 (the special char is right here after "190") acre ranch Looking at this through a hex editor: 0001140 3931 b130 6120 7263 2065 6172 636e 2068 1 9 0 1 sp a c r e sp r a n c h sp Does that clarify anything? – Paul Cravey May 02 '12 at 17:12
  • Which program does give you that error? Is it PHP? If so, what is the related PHP code? – hakre May 02 '12 at 19:05
  • This error happens when the .xml file is viewed in the browser (FireFox 10 in my case), but also other browsers. – Paul Cravey May 02 '12 at 20:58

1 Answers1

0

Whatever it is: UTF-8, -16 or -32 - If you choose some other encoding with your output, you must - if it differs - re-encode your input for output first.

You clearly state in your question that you don't know what the input encoding is exactly. That's a point you need to get clear straight because encoding is meta-information. You need to know it properly to process strings. From what you shared, it looks like the input is UTF-8 encoded. You should verify that (How to detect malformed utf-8 string in PHP?).

Next thing is, that malformed must not mean an encoding problem (but it can). As long as you don't share the source of the problem (ideally next to text-form as well with a hex-dump), there is not much advice that can be given for the current information I'd say.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • Here's an example: XML Parsing Error: not well-formed Location: http://x.x.x.x/xml/hal-default.xml Line Number 20, Column 50: The Hangmans Creek Ranch is a 190 (the special char is right here after "190") acre ranch Looking at this through a hex editor: 0001140 3931 b130 6120 7263 2065 6172 636e 2068 1 9 0 1 sp a c r e sp r a n c h sp Does that clarify anything? – Paul Cravey May 02 '12 at 17:10