How to retrieve the encoding of an XML file to parse it correctly? (Best Practice)

Question

My application downloads xml files that happen to be either encoded in UTF-8 or ISO-8859-1 (the software that generates those files is crappy so it does that). I'm from Germany so we're using Umlauts (ä,ü,ö) so it really makes a difference how those files are encoded. I know that the XmlPullParser has a method .getInputEncoding() which correctly detects how my files are encoded. However I have to set the encoding in my FileInputStream already (which is before I get to call .getInputEncoding()). So far I'm just using a BufferedReader to read the XML file and search for the entry that specifies the encoding and then instantiate my PullParser afterwards.

private void setFileEncoding() {
    try {
        bufferedReader.reset();
        String firstLine = bufferedReader.readLine();
        int start = firstLine.indexOf("encoding=") + 10; // +10 to actually start after "encoding="

        String encoding = firstLine.substring(start, firstLine.indexOf("\"", start));

        // now set the encoding to the reader to be used for parsing afterwards
        bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream, encoding));
        bufferedReader.mark(0);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

Is there a different way to do this? Can I take advantage of the .getInputEncoding method? Right now the method seems kinda useless to me because how does my encoding matter if I've already had to set it before being able to check for it.

score 0 · Accepted Answer · edited May 23 '17 at 12:00

0

If you trust the creator of the XML to have set the encoding correctly in the XML declaration, you can sniff it as you're doing. However, be aware that it can be wrong; it can disagree with the actual encoding.

If you want to detect the encoding directly, independently of the (potentially wrong) XML declaration encoding setting, use a library such as ICU CharsetDetector or the older jChardet.

ICU CharsetDetector:

CharsetDetector detector;
CharsetMatch match;
byte[] byteData = ...;

detector = new CharsetDetector();

detector.setText(byteData);
match = detector.detect();

jChardet:

    // Initalize the nsDetector() ;
    int lang = (argv.length == 2)? Integer.parseInt(argv[1])
                                     : nsPSMDetector.ALL ;
    nsDetector det = new nsDetector(lang) ;

    // Set an observer...
    // The Notify() will be called when a matching charset is found.

    det.Init(new nsICharsetDetectionObserver() {
            public void Notify(String charset) {
                HtmlCharsetDetector.found = true ;
                System.out.println("CHARSET = " + charset);
            }
    });

    URL url = new URL(argv[0]);
    BufferedInputStream imp = new BufferedInputStream(url.openStream());

    byte[] buf = new byte[1024] ;
    int len;
    boolean done = false ;
    boolean isAscii = true ;

    while( (len=imp.read(buf,0,buf.length)) != -1) {

            // Check if the stream is only ascii.
            if (isAscii)
                isAscii = det.isAscii(buf,len);

            // DoIt if non-ascii and not done yet.
            if (!isAscii && !done)
                done = det.DoIt(buf,len, false);
    }
    det.DataEnd();

    if (isAscii) {
       System.out.println("CHARSET = ASCII");
       found = true ;
    }

edited May 23 '17 at 12:00

Community

1
1

answered Sep 19 '16 at 03:14

kjhughes

106,133
27
181
240

Thanks! However this doesn't seem less simple than my current working solution. – Crosswind Sep 19 '16 at 05:45
@Crosswind: I've updated the answer to acknowledge the viability *and limitations* of sniffing the XML declaration. – kjhughes Sep 19 '16 at 12:26
Thanks for your help! Well I do trust the owner right now (but I'll keep the rest of your answer for later!). What do you mean by 'sniff it as you're doing'? Would my solution be suitable then? Or any other suggestions? – Crosswind Sep 19 '16 at 12:29
Sniffing means reading into the parse stream to detect something without doing a full parse. As I've said, your solution is suitable only if you trust the creator of the XML, [which isn't always a good move](http://stackoverflow.com/questions/29915467/there-is-no-unicode-byte-order-mark-cannot-switch-to-unicode/29918434). Another possibility would be to trust the HTTP heading, if available, for the character encoding setting. – kjhughes Sep 19 '16 at 12:43
Thanks! This definitely answered my question. – Crosswind Sep 21 '16 at 10:29
I don't know if it's me but icu charsetdetector freezes on linux, it does not even throw an exception, it just stops there – Pedro Joaquín Jan 06 '21 at 16:39
@PedroJoaquín: Ask a new question that includes a [mcve] illustrating your problem. Thanks. – kjhughes Jan 06 '21 at 18:25

score 0 · Answer 2 · answered Sep 19 '16 at 04:40

0

You may be able to get the correct character-set from the content-type header, if your server sends it correctly.

answered Sep 19 '16 at 04:40

lionscribe

3,413
1
16
21

How to retrieve the encoding of an XML file to parse it correctly? (Best Practice)

2 Answers2

ICU CharsetDetector:

jChardet: