After changing file encoding Windows get it wrong

Question

I wanted to change file's encoding form ones to the other(doesn't matter which). But when i open the file with the result(file w.txt) it is messed up inside. Windows does not understand it correct.

What result encoding should i put (args[1]) so it will be interpreted by windows notepad correct?

 import java.io.*;
import java.nio.charset.Charset;

public class Kodowanie {

    public static void main(String[] args) throws IOException {
        args = new String[2];
        args[0] = "plik.txt";
        args[1] = "ISO8859_2";
        String linia, s = "";
        File f = new File(args[0]), f1 = new File("w.txt");
        FileInputStream fis = new FileInputStream(f);
        InputStreamReader isr = new InputStreamReader(fis,
                Charset.forName("UTF-8"));
        BufferedReader in = new BufferedReader(isr);

        FileOutputStream fos = new FileOutputStream(f1);
        OutputStreamWriter osw = new OutputStreamWriter(fos,
                Charset.forName(args[1]));
        BufferedWriter out = new BufferedWriter(osw);
        while ((linia = in.readLine()) != null) {
            out.write(linia);
            out.newLine();
        }
        out.close();
        in.close();

    }

}

input:

Ala
ma 
Kota

output:

?Ala
ma 
Kota

Why there is a '?'

How do you know the file is messed up? Does your file viewer supports the file encoding? — Edwin Dalorzo, Oct 29 '12 at 01:27
Probably, the ? indicates the presence of Byte Order Mark (BOM) of any file saved/created with Unicode encoding. — ee., Oct 29 '12 at 02:30

score 1 · Answer 1 · answered Oct 29 '12 at 01:18

1

The default encoding in Windows is Cp1252.

answered Oct 29 '12 at 01:18

thedayofcondor

3,860
1
19
28

But Charset.forName... does not know this one. – user1769735 Oct 29 '12 at 01:20
1

Use `Charset.forName("windows-1252")`. See [JDK Supported Encodings](http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html) – Edwin Dalorzo Oct 29 '12 at 01:25
1

Cp1252 is a subset of ISO-8859-1... Try with that one – thedayofcondor Oct 29 '12 at 01:27
I get the weird ? sign at the beginnign of the output file. – user1769735 Oct 29 '12 at 01:32
1

http://www.herongyang.com/Unicode/Notepad-Using-Notepad-as-Unicode-Text-Editor.html – thedayofcondor Oct 29 '12 at 01:38
1

@user1769735 that's normal given your conversion from a super set of characters (utf) to a subset (us-ascii). – Edwin Dalorzo Oct 29 '12 at 01:40
1

Try opening it with notepad++ to check - it has a much better support for unicode – thedayofcondor Oct 29 '12 at 01:50
Yea this ? is still there. Do you see any mistake in my code? – user1769735 Oct 29 '12 at 01:58
1

Your code is perfect... My only criticism would be on the out.newLine(); - there is no guarantee you are re-adding the same line terminator stripped by readLine... but that wouln not affect the initial ? – thedayofcondor Oct 29 '12 at 02:11
1

What happens if you open a windows file and save it back with the default encoding? do you still get the extra ? – thedayofcondor Oct 29 '12 at 02:12
Ok so thank you for help. Maybe it is just my system bug or sth like that. – user1769735 Oct 29 '12 at 02:13
It stays, but as I said maybe this is my system bug. I use for 4 last days windows 7 64bit, previously XP 32-bit and never had that problem.Going to sleep it is 3:16am here. – user1769735 Oct 29 '12 at 02:15
1

The initial byte is called BOM. Notepad++ has an option, with or without BOM, but it should autodetect it. Also, see http://tripoverit.blogspot.co.uk/2007/04/javas-utf-8-and-unicode-writing-is.html?m=1 – thedayofcondor Oct 29 '12 at 02:18

score 1 · Answer 2 · edited May 23 '17 at 10:34

1

US-ASCII is a subset of unicode (a pretty small one by the way). You are reading a file in UTF-8 and then you write it back in US-ASCII. Thus your the encoder will have to take a desicion when a given UTF character cannot be expressed in terms of the reduced 7-bit US-ASCII subset. Clasically, this is repaced by a default charcter, like ?.

Take into account that characters in UTF-8 are multibyte in many cases, whereas US-ASCII is only 7-bit long. This means that al unicode characters above byte 127 cannot be expressed in US-ASCII. That could explain the question marks that you see once the file has been converted.

I had answered a similar question Reading Strange Unicode Characters in Java. Perhaps it helps.

I also recommend you to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

edited May 23 '17 at 10:34

Community

1
1

answered Oct 29 '12 at 01:32

Edwin Dalorzo

76,803
25
144
205

1

@user1769735 The problem is not in your code, but in the data, or in your idea of how the data should be manipulated. Who created the file you are reading? Yourself or somebody else? What was the encoding used when the file was created? – Edwin Dalorzo Oct 29 '12 at 02:18
By me. I used save as and chose UTF-8. – user1769735 Oct 29 '12 at 07:55
@user1769735 In that case your file contains a initial character known as the [BOM - Byte Order Mask](http://en.wikipedia.org/wiki/Byte_order_mark). My recommendation would be that you save your file in the same encoding you pretend to use in your program, or with UTF without BOM, but in you insist with this, then you can simply overlook the first byte when reading your file (which would only work if your file has been encoded with UTF using BOM). – Edwin Dalorzo Oct 29 '12 at 13:29

After changing file encoding Windows get it wrong

2 Answers2