1

I'm reading in a file and replacing some text, then writing a new file, line by line. I use the following code to read and write the file. Usually there are no issues with files that are CP1252 and UTF-8 encoded, but when I try reading in a file that is encoded in "UCS-2 LE BOM" the file that is saved starts with BOM characters and contains whole lot of whitespace. I know that this is due to the encoding but I don't know if I need to read it in differently or save it differently. Also, I know I could set the encoding when I read the file in, but how can I handle differntly-encoded files without knowing which one is coming. I have no control over the file until it hits my java code. Any help is appreciated, thank you.

        FileInputStream sourceFileInputStream = new FileInputStream(sourceFile);
        DataInputStream sourceDataInputStream = new DataInputStream(sourceFileInputStream);

        BufferedReader sourceBufferedReader = new BufferedReader(
                new InputStreamReader(sourceDataInputStream));
        FileWriter targetFileWriter = new FileWriter(new File(targetFileLocation));
        BufferedWriter targetBufferedWriter = new BufferedWriter(
                targetFileWriter);
                  .
                  .
                  .
        targetBufferedWriter.write(newTextline);
  • 1
    try with InputStreamReader and OutputStreamReader. – Omore Apr 14 '17 at 17:16
  • Can you use the `file` command to determine the correct file type? – Erich Kitzmueller Apr 14 '17 at 17:17
  • 1
    Generally you have to have meta data that records the character encoding for a file. You can't always inspect it and determine the correct encoding. However, you can peek at the first few bytes and determine if there's a BOM and its endianess. Distinguishing between UTF-8 and Cp1252 isn't necessary if the content is all in the ASCII range, but otherwise, guessing would require reading the whole file and making a probabilistic guess about which is right. – erickson Apr 14 '17 at 17:22
  • Maybe this http://stackoverflow.com/questions/3759356/what-is-the-most-accurate-encoding-detector topic with answers, can help you? – Вардан Матевосян Apr 14 '17 at 20:10

1 Answers1

0
  1. The BOM can indicate several encodings, not just UTF-8. See Wikipedia article Byte order mark.

  2. In the absence of a BOM, you don't need to read the whole file, you can read just as much as needed until you have meaningful statistics. Often 100 or so bytes are enough - I once wrote a program that did that. On the other hand there is a certain chance the even if you read the entire file the statistics will not be conclusive. The method I used was based on letter frequency - unigram, bigram and trigram frequencies by language, and the relationship of encoding to language. When calculating bigram and trigram frequencies I suggest that whitespace should be considered in its own right. This will account for the frequency of letters at the beginning and at the end of words. So for "now is the" the bigrams will be no, o_, i, is, s, t, th, he, e. See Monogram, Bigram and Trigram frequency counts.

Jonathan Rosenne
  • 2,159
  • 17
  • 27