0

I have an XML which is using unusual entities when filenames contain chinese characters. I have no idea how to decode these filenames. Any ideas?

<string name="Name" value="&Aacute;&yacute;&frac34;&micro; &ordm;&pound;&Iacute;&otilde;&Ocirc;&curren;&cedil;&aelig;&AElig;&not;-01.wav"/>

The resulting name should be 慢镜 海王预告片-01.wav

How would I turn these back into the correct name?

mzjn
  • 48,958
  • 13
  • 128
  • 248
John Baker
  • 89
  • 12
  • 2
    Looks like you have an interesting case of [mojibake](https://en.wikipedia.org/wiki/Mojibake) there! Most likely UTF-8 misread as some 8-bit encoding, then escaped with HTML entities, but you'll have to experiment a bit to get the right combination. What tool or language do you have to do the decoding with? – IMSoP Dec 16 '18 at 18:38
  • 2
    I've explored around this a bit, and failed to find the connection between the HTML entity references and the supposed decoding. Something has clearly gone badly wrong, and I'd suggest tracing it back to the root cause. I suspect there's multiple layers of re-encoding of incorrectly encoded strings here. – Michael Kay Dec 16 '18 at 19:15
  • This is an XML export from Nuendo (audio editing app). Weirdly it doesn't have the usual first line in XMLs stating the encoding. So they have somehow kludged the system platform encoding into bytes in the attributes I think. The system encoding was GB18030 I think. However Nuendo can definitelty reimport this file and the names are still correct. – John Baker Dec 16 '18 at 22:57
  • What programming language are you using? Add the language to the tag. – jdweng Dec 17 '18 at 10:19
  • Java. Looks like it's GB18030 encoded as Latin-1. What I really need is an entity resolver for woodstox that understands all these w3c entities. There's hundreds! https://www.w3.org/TR/xml-entity-names/bycodes.html – John Baker Dec 17 '18 at 13:33
  • @JohnBaker Can you [edit] your question to include the whole XML content? I looks weird to have HTML entities in an XML document. – Progman Dec 17 '18 at 15:02
  • @Progman The XML was too large to put in the question. It is only names and filenames that seem to use this strange HTML encoding format. The XML also has no encoding information at the top. Here is a temporary link (one week) to the whole file. https://wetransfer.com/downloads/b3e770fe0e36d146fe08f6ddc8b695eb20181217192606/ca986f – John Baker Dec 17 '18 at 19:27
  • @JohnBaker The problem is that the generated XML file from the Nuendo app is invalid. You have to file a bug report with the developers of the Nuendo app and/or check https://stackoverflow.com/questions/44765194/how-to-parse-invalid-bad-not-well-formed-xml on how to deal with invalid XML. – Progman Dec 17 '18 at 19:44
  • @Progman That would be ideal except in the real world you have to workaround whatever rubbish some 'pro' editing app throws out. Premiere pro for instance does lots of strange non compliant XML (I have about 10 workarounds in converting that). No company cares because with interchange you are moving their data to a different app. Avid doesn't even follow their own interchange standards. This is normal in crazy media world. – John Baker Dec 17 '18 at 23:39

1 Answers1

0

It looks like text encoded in the GB18030 encoding has been interpreted as Latin-1 and then the characters have been escaped as HTML entity references.

The unescapeHtml4() method of the StringEscapeUtils class from Apache Commons Text can be used to unescape entity references, which is demonstrated by the small program below.

笼镜 海王预告片-01.wav is printed to standard output. This is very similar to what you asked for. Only the first Chinese character is different. If &Aacute; in the input string is changed to &Acirc;, then the program outputs the exact wanted filename (慢镜 海王预告片-01.wav).

import java.nio.charset.Charset;
import java.io.PrintStream;
import org.apache.commons.text.StringEscapeUtils;

public class Chinese {
    public static void main(String[] args) {
        String fname = "&Aacute;&yacute;&frac34;&micro; &ordm;&pound;&Iacute;&otilde;&Ocirc;&curren;&cedil;&aelig;&AElig;&not;-01.wav";
        decode(fname);
    }

    static void decode(String s) {
        Charset latin1 = Charset.forName("latin1");
        Charset gb18030 = Charset.forName("gb18030");
        Charset utf8 = Charset.forName("utf8");

        String unescaped = StringEscapeUtils.unescapeHtml4(s);
        byte[] latin1_bytes = unescaped.getBytes(latin1);
        String text = new String(latin1_bytes, gb18030);

        PrintStream ps = new PrintStream(System.out, true, utf8);
        ps.println(text);
    }
}
mzjn
  • 48,958
  • 13
  • 128
  • 248
  • Thanks. I'll try updating my entity resolver to use this method and see what happens. Although if the first character is different then it won't work for all filenames. Similar is not enough. – John Baker Jan 01 '19 at 17:32
  • It is strange that it works save for one character. I don't know Chinese and there is only one sample of input and expected output, so it's hard to tell what the underlying problem is. – mzjn Jan 02 '19 at 06:23
  • Surely this answer isn't completely off base. There must be a reason why my suggestion produces something that is close to the wanted result. It would be interesting to know which filename makes the most sense to someone who understands Chinese. – mzjn Jan 04 '19 at 08:26