0

I have a Japanese content which is being converted to MS help with a certain tool. The problem is that the third party tool isn't using utf-8 encoding and is creating a .xml with garbage characters:

    <param name="Name" value="&#195;&#137;A&#195;&#137;v&#195;&#137;&#195;&#164;&#195;&#137;P&#195;&#133;&#195;&#137;V&#195;&#137;&#195;&#161;&#195;&#137;&#195;&#172;&#195;&#135;&#8224;&#195;&#135;'&#195;&#135;&#195;&#139;&#195;&#135;&#195;&#152;&#195;&#133;&#501;&#195;&#135;&#195;&#039;&#195;&#135;&#195;&#039;]">
    <param name="Name" value="Test File">
    <param name="Local" value="applications.htm#Xau1044547">

I tried playing around with the encoding and it now produces:

    <param name="Name" value="ÉAÉvÉäÉPÅ">
    <param name="Name" value="Test">
    <param name="Local" value="applications.htm#Xau1044547">

But with utf-8 encoding (another tool) and the correct output should be:

    <param name="Name" value="アプリケーション">
    <param name="Name" value="Small Business アプリケーションの起動 ">
    <param name="Local" value="applications1.html#wp1044548">

Is there any java API I can use to decode and encode the files to have the correct output. I am not sure what the tool is using but I am guessing its "ISO-8859-1".

Thanks.

Sumaiya
  • 1
  • 1

2 Answers2

1

Your problem is that you need to use two encodings correctly:

  • Find out what encoding your "Japanese content" uses
  • Make sure the tool correctly uses that encoding to read that content
  • Make sure the tool uses UTF-8 to encode the output file and correctly declares that in its header.
Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • I was hoping to do some post processing to the file and get the right characters.That is why I have been trying some Java API to encode decode the file, so far without any sucess. – Sumaiya Apr 12 '11 at 13:41
  • @Sumaiya: post processing is the wrong method to address encoding problems, because it's often fundamentally impossible to fix data that has been corrupted by the wrong usage of encodings. – Michael Borgwardt Apr 12 '11 at 14:34
0

It would appear from the upper-most sample that your encoding at that point is already corrupt. The value for the first "Name" attribute it being represented with HTML character escape codes (decimal NCR).

That being said, the 2nd samples (value="ÉAÉvÉäÉPÅ") and 3rd samples (value="アプリケーション") do not match the 1st.

If HTML character escapes are indeed what the output should be, then the output encoding would be ASCII or some other variant, and the value would then be:

value="&#12450;&#12503;&#12522;&#12464;&#12540;&#12471;&#12519;&#12531;"

I think you would need to reconfirm how this 3rd party tool is outputting the XML.

buruzaemon
  • 3,847
  • 1
  • 23
  • 44