1

I wanted to change file's encoding form ones to the other(doesn't matter which). But when i open the file with the result(file w.txt) it is messed up inside. Windows does not understand it correct.

What result encoding should i put (args[1]) so it will be interpreted by windows notepad correct?

 import java.io.*;
import java.nio.charset.Charset;

public class Kodowanie {

    public static void main(String[] args) throws IOException {
        args = new String[2];
        args[0] = "plik.txt";
        args[1] = "ISO8859_2";
        String linia, s = "";
        File f = new File(args[0]), f1 = new File("w.txt");
        FileInputStream fis = new FileInputStream(f);
        InputStreamReader isr = new InputStreamReader(fis,
                Charset.forName("UTF-8"));
        BufferedReader in = new BufferedReader(isr);

        FileOutputStream fos = new FileOutputStream(f1);
        OutputStreamWriter osw = new OutputStreamWriter(fos,
                Charset.forName(args[1]));
        BufferedWriter out = new BufferedWriter(osw);
        while ((linia = in.readLine()) != null) {
            out.write(linia);
            out.newLine();
        }
        out.close();
        in.close();

    }

}

input:

Ala
ma 
Kota

output:

?Ala
ma 
Kota

Why there is a '?'

user1769735
  • 143
  • 2
  • 2
  • 10
  • How do you know the file is messed up? Does your file viewer supports the file encoding? – Edwin Dalorzo Oct 29 '12 at 01:27
  • 1
    Probably, the ? indicates the presence of Byte Order Mark (BOM) of any file saved/created with Unicode encoding. – ee. Oct 29 '12 at 02:30

2 Answers2

1

The default encoding in Windows is Cp1252.

thedayofcondor
  • 3,860
  • 1
  • 19
  • 28
1

US-ASCII is a subset of unicode (a pretty small one by the way). You are reading a file in UTF-8 and then you write it back in US-ASCII. Thus your the encoder will have to take a desicion when a given UTF character cannot be expressed in terms of the reduced 7-bit US-ASCII subset. Clasically, this is repaced by a default charcter, like ?.

Take into account that characters in UTF-8 are multibyte in many cases, whereas US-ASCII is only 7-bit long. This means that al unicode characters above byte 127 cannot be expressed in US-ASCII. That could explain the question marks that you see once the file has been converted.

I had answered a similar question Reading Strange Unicode Characters in Java. Perhaps it helps.

I also recommend you to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Community
  • 1
  • 1
Edwin Dalorzo
  • 76,803
  • 25
  • 144
  • 205
  • 1
    @user1769735 The problem is not in your code, but in the data, or in your idea of how the data should be manipulated. Who created the file you are reading? Yourself or somebody else? What was the encoding used when the file was created? – Edwin Dalorzo Oct 29 '12 at 02:18
  • By me. I used save as and chose UTF-8. – user1769735 Oct 29 '12 at 07:55
  • @user1769735 In that case your file contains a initial character known as the [BOM - Byte Order Mask](http://en.wikipedia.org/wiki/Byte_order_mark). My recommendation would be that you save your file in the same encoding you pretend to use in your program, or with UTF without BOM, but in you insist with this, then you can simply overlook the first byte when reading your file (which would only work if your file has been encoded with UTF using BOM). – Edwin Dalorzo Oct 29 '12 at 13:29