13

I want get the encoding from a stream.

1st method - to use the InputStreamReader.

But it always return OS encode.

InputStreamReader reader = new InputStreamReader(new FileInputStream("aa.rar"));
System.out.println(reader.getEncoding());

output:GBK

2nd method - to use the UniversalDetector.

But it always return null.

    FileInputStream input = new FileInputStream("aa.rar");

    UniversalDetector detector = new UniversalDetector(null);
    byte[] buf = new byte[4096];

    int nread;
    while ((nread = input.read(buf)) > 0 && !detector.isDone()) {
        detector.handleData(buf, 0, nread);
    }

    // (3)
    detector.dataEnd();

    // (4)
    String encoding = detector.getDetectedCharset();

    if (encoding != null) {
        System.out.println("Detected encoding = " + encoding);
    } else {
        System.out.println("No encoding detected.");
    }

    // (5)
    detector.reset();

output:null

How can I get the right? :(

Alex K
  • 22,315
  • 19
  • 108
  • 236
youzhi.zhang
  • 141
  • 1
  • 1
  • 8
  • 4
    InputStreamReader will always use platform encoding. It does not attempt to detect encoding in files. What type of files are you running through UniversalDetector? In your example you used a RAR file, which is a compressed binary format. Try with a simple ASCII text file first. – prunge Nov 29 '11 at 04:37
  • hi, i'm changed the file type, 'Fortunes.txt' output:No encoding detected – youzhi.zhang Nov 29 '11 at 05:03
  • It doesn't seem to detect 'standard' UTF-8 or UTF-16 without a BOM, but it worked for UTF-16 with a BOM for me. Maybe consider using a different library for charset detection? [This link](http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream) might help. – prunge Nov 29 '11 at 06:36
  • 3
    Detecting encodings by inspecting text data is unreliable guesswork. You really need to have the encoding specified as metadata somewhere to be sure. – Michael Borgwardt Nov 29 '11 at 09:33
  • @Michael Borwardt: but in many cases you do *not* have any metadata specifying the encoding and you do *not* have any specs telling you in which encoding the txt files you need to parse will be encoded. In these cases the "guesswork" done by things like: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html (using letters frequency in addition to a lot of other heuristics) seems to be quite "scientific" a guesswork. All is not always black and white. When you do not have metadata, you do not say: *"I need metadata"* but you work hard and you write (or reuse) a detector. – TacticalCoder Nov 29 '11 at 12:59

2 Answers2

7

Let's resume the situation:

  • InputStream delivers bytes
  • *Readers deliver chars in some encoding
  • new InputStreamReader(inputStream) uses the operating system encoding
  • new InputStreamReader(inputStream, "UTF-8") uses the given encoding (here UTF-8)

So one needs to know the encoding before reading. You did everything right using first a charset detecting class.

Reading http://code.google.com/p/juniversalchardet/ it should handle UTF-8 and UTF-16. You might use the editor JEdit to verify the encoding, and see whether there is some problem.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • We can use other tools to achieve, but it can't understand the specific treatment method,Seems to be to deal with. :( – youzhi.zhang Nov 30 '11 at 01:23
  • Juniversalchardet doesn't support ISO-8859-1, which is a very common charset – Thomas Jun 15 '21 at 11:01
  • @Thomas universalchardet originates from the browser area, where ISO-8859-1 is reinterpreted as Windows-1252 (officially since HTML 5), so maybe Window-1252 aka Cp1252 works. YES, checked – Joop Eggen Jun 15 '21 at 12:41
0
    public String getDecoder(InputStream inputStream) {

    String encoding = null;

    try {
        byte[] buf = new byte[4096];
        UniversalDetector detector = new UniversalDetector(null);
        int nread;

        while ((nread = inputStream.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }

        detector.dataEnd();
        encoding = detector.getDetectedCharset();
        detector.reset();

        inputStream.close();

    } catch (Exception e) {
    }

    return encoding;
}
Mohammed Saqib Rajput
  • 1,331
  • 1
  • 14
  • 22