137

I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all.

Here's my environment:

  • Windows 2003, OS encoding: CP1252

  • Java 5.0

My files are UTF-8 encoded or CP1252 encoded, and some of them (UTF-8 encoded files) may contain Chinese (non-Latin) characters.

I use the following code to do my work:

   private static String readFileAsString(String filePath)
    throws java.io.IOException{
        StringBuffer fileData = new StringBuffer(1000);
        FileReader reader = new FileReader(filePath);
        //System.out.println(reader.getEncoding());
        BufferedReader reader = new BufferedReader(reader);
        char[] buf = new char[1024];
        int numRead=0;
        while((numRead=reader.read(buf)) != -1){
            String readData = String.valueOf(buf, 0, numRead);
            fileData.append(readData);
            buf = new char[1024];
        }
        reader.close();
        return fileData.toString();
    }

The above code doesn't work. I found the FileReader's encoding is CP1252 even if the text is UTF-8 encoded. But the JavaDoc of java.io.FileReader says that:

The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate.

Does this mean that I am not required to set character encoding by myself if I am using FileReader? But I did get wrongly encoded data currently, what's the correct way to deal with my situtaion? Thanks.

nybon
  • 8,894
  • 9
  • 59
  • 67
  • You should also loose the String.valueOf() inside the loop and use StringBuffer.append(char[],int,int) directly. This saves a lot of copying of the char[]. Also replace StringBuffer with StringBuilder. None of this is about your question, 'though. – Joachim Sauer Mar 30 '09 at 12:01
  • 1
    I hate to say it, but have you read the JavaDoc right after the part you pasted? You know, the part that says "To specify these values yourself, construct an InputStreamReader on a FileInputStream."? – Powerlord Mar 30 '09 at 13:55
  • Thanks for your comment, actually I read the JavaDoc, but what I am not sure is whether or not I should specify these values myself, and switch to "construct an InputStreamReader on a FileInputStream". – nybon Mar 31 '09 at 01:05
  • Yes, if you know the file is in something other than the platform default encoding, you have to tell the InputStreamReader which one to use. – Alan Moore Mar 31 '09 at 04:46

6 Answers6

270

Yes, you need to specify the encoding of the file you want to read.

Yes, this means that you have to know the encoding of the file you want to read.

No, there is no general way to guess the encoding of any given "plain text" file.

The one-arguments constructors of FileReader always use the platform default encoding which is generally a bad idea.

Since Java 11 FileReader has also gained constructors that accept an encoding: new FileReader(file, charset) and new FileReader(fileName, charset).

In earlier versions of java, you need to use new InputStreamReader(new FileInputStream(pathToFile), <encoding>).

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • 1
    InputStream is = new FileInputStream(filename); here i got error file not found error with Russian file name – Bhanu Sharma Feb 10 '14 at 08:59
  • 3
    +1 for the suggestion of using InputStreamReader, however using links in code blocks makes it hard to copy and paste the code, if this can be changed, thx – Ferrybig Sep 26 '15 at 16:12
  • 1
    Would it be "UTF-8" or "UTF8" in the encodings. According to [the Java SE reference on encoding](https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html), since `InputStreamReader` is a `java.io` class, it would be "UTF8"? – NobleUplift Nov 13 '15 at 19:01
  • 11
    @NobleUplift: the safest bet is `StandardCharsets.UTF_8`, there's no chance of mistyping there ;-) But yes, if you go with string `"UTF8"` would be correct (although I seem to remember that it will accept both ways). – Joachim Sauer Nov 13 '15 at 19:53
  • @NobleUplift Actually I think Java accepts most permutations of 'UTF-8', with and without dash and upper- and lowercase letters. – Stijn de Witt Nov 20 '15 at 15:58
  • 1
    @JoachimSauer Actually, this is one of the purposes of the `Byte Order Mark`, along with.. well.. establishing the byte order! :) As such I find it weird that Java's FileReader is not able to automatically detect UTF-16 that has such a BOM... In fact I once wrote a `UnicodeFileReader` that does exactly that. Unfortunately closed source, but Google has it's [UnicodeReader](https://developers.google.com/gdata/javadoc/com/google/gdata/util/io/base/UnicodeReader) which is very similar. – Stijn de Witt Nov 20 '15 at 16:02
  • @StijndeWitt: as far as I know the byte order marker is only meant to indicate which UTF-16 variant is used (LE or BE), not to distinguish between UTF-16 and other encodings. It has been used for UTF-8 as well, but that has never formally been standardized. So all in all I wouldn't count that as a reliable way to know the encoding of random files (if you know that all your files are some UTF-16 variant, then go ahead and use it, but otherwise I wouldn't want to depend on it). – Joachim Sauer Nov 20 '15 at 16:34
  • @JoachimSauer The BOM is explicitly meant to serve the dual purpose of establishing byte order AND of acting as a signature marker: *"Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in."* http://unicode.org/faq/utf_bom.html – Stijn de Witt Nov 21 '15 at 15:17
  • 1
    @StijndeWitt: I stand corrected. The main problem still exists: this only helps if the data in question is UTF-* **and** has a BOM, which is not required by the spec (and it is often not present). – Joachim Sauer Nov 21 '15 at 16:45
  • @JoachimSauer Yes, very true. Probaly we should make a habit of starting text files with a BOM... problem is some older software trips over it. However, *if* it's there we could and should use it. Too bad `FileReader` can't cope with it. – Stijn de Witt Nov 21 '15 at 18:06
80

FileReader uses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale.

If this "best guess" is not correct then you have to specify the encoding explicitly. Unfortunately, FileReader does not allow this (major oversight in the API). Instead, you have to use new InputStreamReader(new FileInputStream(filePath), encoding) and ideally get the encoding from metadata about the file.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • 24
    "major oversight in the API" - thanks for this explanation - I was wondering why I couldn't find the constructor I was after ! Cheers John – monojohnny Apr 12 '13 at 13:42
  • @Bhanu Sharma: that's an encoding issue at a different level, check where you're getting the filename from, and if it's hardcoded what encoding the compiler uses. – Michael Borgwardt Feb 10 '14 at 09:27
  • i use both file name and hardcoded as a string but same prob i get what should i do :( – Bhanu Sharma Feb 10 '14 at 09:34
  • /storage/emulated/0/bhanuдосвидания.txt: open failed: ENOENT (No such file or directory) – Bhanu Sharma Feb 10 '14 at 09:35
  • please help sir i stuck in this very badly...please help :( – Bhanu Sharma Feb 10 '14 at 09:42
  • 1
    @BhanuSharma: filename encoding issues are nothing to do with this question. See one of the many existing “why don't Unicode filenames work in Java” questions. Spoiler: java.io APIs like FileReader use C standard library filesystem calls, which can't support Unicode on Windows; consider using java.nio instead. – bobince May 12 '15 at 07:48
  • 1
    "`FileReader` uses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale." I wouldn't say that. At least of Windows. For some weird technical/historical reasons, the JVM ignores the fact that Unicode is the [recommended](https://msdn.microsoft.com/en-us/library/windows/desktop/dd374083(v=vs.85).aspx) encoding on Windows for 'all new applications' and instead *always* acts as if the legacy encoding configured *as fallback for legacy apps* is the 'platform default'. – Stijn de Witt Nov 20 '15 at 16:06
  • 7
    I would even go as far as saying that if your Java app does not *explicitly* specify encodings every time it's reading or writing to files/streams/resources, it's *broken*, because it *can not ever* work reliably then. – Stijn de Witt Nov 20 '15 at 16:08
14

For Java 7+ doc you can use this:

BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);

Here are all Charsets doc

For example if your file is in CP1252, use this method

Charset.forName("windows-1252");

Here is other canonical names for Java encodings both for IO and NIO doc

If you do not know with exactly encoding you have got in a file, you may use some third-party libs like this tool from Google this which works fairly neat.

Andreas Gelever
  • 1,736
  • 3
  • 19
  • 25
8

Since Java 11 you may use that:

public FileReader(String fileName, Charset charset) throws IOException;
Radoslav Ivanov
  • 970
  • 8
  • 23
1

FileInputStream with InputStreamReader is better than directly using FileReader, because the latter doesn't allow you to specify encoding charset.

Here is an example using BufferedReader, FileInputStream and InputStreamReader together, so that you could read lines from a file.

List<String> words = new ArrayList<>();
List<String> meanings = new ArrayList<>();
public void readAll( ) throws IOException{
    String fileName = "College_Grade4.txt";
    String charset = "UTF-8";
    BufferedReader reader = new BufferedReader(
        new InputStreamReader(
            new FileInputStream(fileName), charset)); 

    String line; 
    while ((line = reader.readLine()) != null) { 
        line = line.trim();
        if( line.length() == 0 ) continue;
        int idx = line.indexOf("\t");
        words.add( line.substring(0, idx ));
        meanings.add( line.substring(idx+1));
    } 
    reader.close();
}
Guangtong Shen
  • 1,402
  • 1
  • 11
  • 12
0

For another as Latin languages for example Cyrillic you can use something like this:

FileReader fr = new FileReader("src/text.txt", StandardCharsets.UTF_8);

and be sure that your .txt file is saved with UTF-8 (but not as default ANSI) format. Cheers!

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Iefimenko Ievgen
  • 379
  • 1
  • 5
  • 12