7

I need to check encoding files. This code work but it's a little bit long. How able to make any refactoring this logic. Maybe can to use some another variant for this target?

Code:

class CharsetDetector implements Checker {

    Charset detectCharset(File currentFile, String[] charsets) {
        Charset charset = null;

        for (String charsetName : charsets) {
            charset = detectCharset(currentFile, Charset.forName(charsetName));
            if (charset != null) {
                break;
            }
        }

        return charset;
    }

    private Charset detectCharset(File currentFile, Charset charset) {
        try {
            BufferedInputStream input = new BufferedInputStream(
                    new FileInputStream(currentFile));

            CharsetDecoder decoder = charset.newDecoder();
            decoder.reset();

            byte[] buffer = new byte[512];
            boolean identified = false;
            while ((input.read(buffer) != -1) && (!identified)) {
                identified = identify(buffer, decoder);
            }

            input.close();

            if (identified) {
                return charset;
            } else {
                return null;
            }

        } catch (Exception e) {
            return null;
        }
    }

    private boolean identify(byte[] bytes, CharsetDecoder decoder) {
        try {
            decoder.decode(ByteBuffer.wrap(bytes));
        } catch (CharacterCodingException e) {
            return false;
        }
        return true;
    }

    @Override
    public boolean check(File fileChack) {
        if (charsetDetector(fileChack)) {
            return true;
        }
        return false;
    }

    private boolean charsetDetector(File currentFile) {
        String[] charsetsToBeTested = { "UTF-8", "windows-1253", "ISO-8859-7" };

        CharsetDetector charsetDetector = new CharsetDetector();
        Charset charset = charsetDetector.detectCharset(currentFile,
                charsetsToBeTested);

        if (charset != null) {
            try {
                InputStreamReader reader = new InputStreamReader(
                        new FileInputStream(currentFile), charset);

                @SuppressWarnings("unused")
                int valueReaders = 0;
                while ((valueReaders = reader.read()) != -1) {
                    return true;
                }

                reader.close();
            } catch (FileNotFoundException exc) {
                System.out.println("File not found!");
                exc.printStackTrace();
            } catch (IOException exc) {
                exc.printStackTrace();
            }
        } else {
            System.out.println("Unrecognized charset.");
            return false;
        }

        return true;
    }
}

Question:

  • How does this program logic refactor?
  • Which are another ways to detect encoding (as UTF-16 sequance etc.)?
catch23
  • 17,519
  • 42
  • 144
  • 217

2 Answers2

5

the best way to refactor this code would be to bring in a 3rd party library that does character detection for you, because they probably do it better and it would make your code smaller. see this question for a few alternatives

Community
  • 1
  • 1
radai
  • 23,949
  • 10
  • 71
  • 115
  • **"3rd party library"** - What is this library? And what does generally do? – catch23 Mar 01 '13 at 09:50
  • @nazar_art - by "library" i mean a *.jar file that contains code that you can use. plenty of people have written character detection code in java before and made it open source - use their code. – radai Mar 01 '13 at 09:52
3

As has been pointed out, you can't "know" or "detect" the encoding of a file. Complete accuracy requires that you be told, as there is almost always a byte sequence which is ambiguous with respect to several character encodings.

You'll find some more discussion about detecting UTF-8 vs ISO8859-1 in this SO question.. The essential answer is to check each byte sequence in the file to verify its compatibility with the encoding expected. For UTF-8 byte encoding rules, see http://en.wikipedia.org/wiki/UTF-8.

In particular, there's a very interesting paper on detecting character encodings/sets http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html They claim they have extremely high accuracy (guesses!). The price is a very complex detection system, complete with knowledge about character frequencies in different languages, that doesn't fit in the 30 lines OP has hinted as being the right code size. Apparently the detection algorithm is built into Mozilla, so you can likely find and extract it.

We settled for a much simpler scheme: a) believe what you are told the character set is, if you are told b) if not, check for BOM and believe what it says if present, otherwise sniff for pure 7 bit ascii, then utf8, or iso8859 in that order. You can build an ugly routine that does this in one pass over the file.

(I think the problem is going to get worse over time. Unicode has a new revision every year, with truly subtle differences in valid code points. To do that right, you need to check every code point for validity. If we're lucky, they're all backwards compatible.)

[EDIT: OP seems to be having trouble coding this in Java. Our solution and the sketch on the other page are not coded in Java so I can't copy and paste an answer directly. I'm going to draft a Java version here based on his code; it isn't compiled or tested. YMMV]

int UTF8size(byte[] buffer, int buf_index)
// Java-version of character-sniffing test on other page
// This only checks for UTF8 compatible bit-pattern layout
// A tighter test (what we actually did) would check for valid UTF-8 code points
{   int first_character=buffer[buf_index];

    // This first character test might be faster as a switch statement
    if ((first_character & 0x80) == 0) return 1; // ASCII subset character, fast path
    else ((first_character & 0xF8) == 0xF0) { // start of 4-byte sequence
        if (buf_index+3>=buffer.length) return 0;
        if (((buffer[buf_index + 1] & 0xC0) == 0x80)
         && ((buffer[buf_index + 2] & 0xC0) == 0x80)
         && ((buffer[buf_index + 3] & 0xC0) == 0x80))
            return 4;
    }
    else if ((first_character & 0xF0) == 0xE0) { // start of 3-byte sequence
        if (buf_index+2>=buffer.length) return 0;
        if (((buffer[buf_index + 1] & 0xC0) == 0x80)
         && ((buffer[buf_index + 2] & 0xC0) == 0x80))
            return 3;
    }
    else if ((first_character & 0xE0) == 0xC0) { // start of 2-byte sequence
        if (buf_index+1>=buffer.length) return 0;
        if ((buffer[buf_index + 1] & 0xC0) == 0x80)
            return 2;
    }
    return 0;
}

public static boolean isUTF8 ( File file ) {
    int file_size;
    if (null == file) {
        throw new IllegalArgumentException ("input file can't be null");
    }
    if (file.isDirectory ()) {
        throw new IllegalArgumentException ("input file refers to a directory");
    }

    file_size=file.size();
    // read input file
    byte [] buffer = new byte[file_size];
    try {
        FileInputStream fis = new FileInputStream ( input ) ;
        fis.read ( buffer ) ;
        fis.close ();
    }
    catch ( IOException e ) {
        throw new IllegalArgumentException ("Can't read input file, error = " + e.getLocalizedMessage () );
    }

    { int buf_index=0;
      int step;

      while (buf_index<file_size) {
         step=UTF8size(buffer,buf_index);
         if (step==0) return false; // definitely not UTF-8 file
         buf_index+=step;

      }

    }

   return true ; // appears to be UTF-8 file
}
Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • How able to check different coding types? To prove that currentFile has some coding type. Which way is better to use at this situation? – catch23 Mar 04 '13 at 21:16
  • The SO question I referenced tells you how to detect the different coding types using a relatively simple check. You use much more complicated check ("UniversalCharsetDetection") with higher accuracy, but unless you want to spend your life replicating that work, I'd stick with the simple scheme. – Ira Baxter Mar 04 '13 at 21:47
  • Which does relatively simple check? I don't understand. Can you display this scheme, better? – catch23 Mar 05 '13 at 06:28
  • What about another encoding("windows-1253", "ISO-8859-7"). How can we it check? – catch23 Mar 05 '13 at 07:55
  • You pretty much can't differentiate 8 bit character sets; they are just interpretations of the 8 bit codes. To the extent that a character set has unique code that others don't have, you can do it. But ISO8859-1 AFAIK defines all 256 codes... so every 8 bit file is arguably ISO8859-1, and also windows-1253, and ... The "UniversalCharset" guys would say if you know the frequencies of characters that occur in normal documents, you can still make an educated guess. So you get to pick their very complex pretty solution, or live with something much simpler, which is what we chose. – Ira Baxter Mar 05 '13 at 08:02
  • `unc ::IsUTF8(unc *cpt)` - What does mean `unc`? `::IsUTF8`? `*cpt`? It's some method? Could you describe. – catch23 Mar 05 '13 at 10:41
  • I think "unc" means "integer", "::IsUTF8" means "define the function IsUTF8 as a globally visible function". All you care about is the core code, which suggests how to classify sequences of bytes if they are valid UTF8. – Ira Baxter Mar 05 '13 at 14:17
  • Here `((*cpt & 0xF8) == 0xF0)` => `*cpt & 0xF8` Why do we need do this? And what does `*cpt` mean? Do you mind me asking, whether you can help me realize this code in java. – catch23 Mar 06 '13 at 15:57
  • Its not my code, but I'd guess "*cpt" means "the character at which cpt points". – Ira Baxter Mar 06 '13 at 17:25
  • What about transformation this method from C to Java. Do you have any suggestions? – catch23 Mar 07 '13 at 10:14
  • I converter in Java method [isUTF at this way](http://pastebin.com/c27wfZYK) ==> As result doesn't work this part `if ((buffer[0] & 0xF8) == 0xF0) {` (and `currentaFile` 100% with good coding) Why does this happen? What is wrong? How does solve this problem? – catch23 Mar 07 '13 at 21:56
  • You have to essentially read the whole file, checking each byte to see if it ist the beginning of a valid UTF-8 byte sequence, skipping that sequence if yes and repeating, until you hit EOF (must be UTF-8) or some byte sequence isn't valid UTF-8 (not a UTF-8 file). The example code on the other page show a way to approximate fairly reasonably if the character sequence is UTF-8. A more precise version requires you encode the Unicode character set standard carefully. You choose. – Ira Baxter Mar 09 '13 at 23:43
  • Can you give a piese of advice how we can check UTF-16(Generally I don't know how do this). Now situation looks as my answer lower. Impotrant point that [this boolean isUTF8(File file)](http://pastebin.com/RueiK6Mb) doesn't work. How do solve this issue? – catch23 Mar 10 '13 at 08:52
  • 1
    A key point is you have to check the *whole file* to see if it is UTFxxx compatible. I've coded (perhaps clumsily) a version for UTF8. The way to check for UTF-16 is to inspect pairs of bytes from the file, to see that all such pairs interpreted as a 16 bit code are a valid UTF16 code point. You need special handling for UTF-16 escape codes used to handle characters in the 0x010000 to 0x10FFFF code point range. – Ira Baxter Mar 10 '13 at 10:39
  • Good answer! Can you share knowledge how to check file at UTF-16 way. Generally only part from `if ((first_character & 0x80) == 0)` - coz this is most complicated part for my level. And at last do we really need read all file? This is very of the resource costly part. – catch23 Mar 10 '13 at 19:21
  • 1
    And if you don't read the whole file, how do you know the byte sequence you didn't read, contains something which is not UTF8? You can read a few thousand bytes instead of the whole file, and make a guess which is probably right most of the time, but all that does is create a headache under rare circumstances. Regarding UTF16: I've basicall written the UTF8 code for you; time to spread your wings and figure out to do the UTF16 on your own. – Ira Baxter Mar 10 '13 at 23:38
  • How does realise this method to UTF-16 sequence? What does general mean `& 0xF8` with `buffer[buf_index + 1]`(this is first bite, this is clear) and after this we compare with `== 0x80`. How do find right number `& ...`? And to be prove make checking as this `if (buf_index+3>=buffer.length) return 0;`. – catch23 Mar 11 '13 at 14:03