Character corruption going from BufferedReader to BufferedWriter in java

Question

In Java, I am trying to parse an HTML file that contains complex text such as greek symbols.

I encounter a known problem when text contains a left facing quotation mark. Text such as

mutations to particular “hotspot” regions

becomes

 mutations to particular “hotspot�? regions

I have isolated the problem by writting a simple text copy meathod:

public static int CopyFile()
{
    try
    {
    StringBuffer sb = null;
    String NullSpace = System.getProperty("line.separator");
    Writer output = new BufferedWriter(new FileWriter(outputFile));
    String line;
    BufferedReader input =  new BufferedReader(new FileReader(myFile));
while((line = input.readLine())!=null)
    {
        sb = new StringBuffer();
        //Parsing would happen
        sb.append(line);
        output.write(sb.toString()+NullSpace);
    }
        return 0;
    }
    catch (Exception e)
    {
        return 1;
    }
}

Can anybody offer some advice as how to correct this problem?

★My solution

InputStream in = new FileInputStream(myFile);
        Reader reader = new InputStreamReader(in,"utf-8");
        Reader buffer = new BufferedReader(reader);
        Writer output = new BufferedWriter(new FileWriter(outputFile));
        int r;
        while ((r = reader.read()) != -1)
        {
            if (r<126)
            {
                output.write(r);
            }
            else
            {
                output.write("&#"+Integer.toString(r)+";");
            }
        }
        output.flush();

is it just me or is the "buffer" Reader obsolete in the last snippet? — ılǝ, May 29 '13 at 01:40

score 6 · Accepted Answer · edited May 23 '17 at 12:01

6

The file read is not in the same encoding (probably UTF-8) as the file written (probably ISO-8859-1).

Try the following to generate a file with UTF-8 encoding:

BufferedWriter output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile),"UTF8"));

Unfortunately, determining the encoding of a file is very difficult. See Java : How to determine the correct charset encoding of a stream

edited May 23 '17 at 12:01

Community

1
1

answered Aug 24 '10 at 17:54

Thierry Roy

8,452
10
60
84

And as far as I know there isn't really an automated way to obtain the encoding of a text file. – extraneon Aug 24 '10 at 17:57
"UTF8" and 16 don't seem to work even though is explicitely stated in the HTML... Does anybdoy know how to look up encoding by going from a know character in a file to an encoding? – Mikhail Aug 24 '10 at 19:46
I tried US-ASCII ISO-8859-1 UTF-8 UTF-16BE UTF-16LE UTF-16 And they don't work... – Mikhail Aug 24 '10 at 21:14
The character's decimal value is "8221", it should be Unicode right? – Mikhail Aug 24 '10 at 21:21
Problem was solved by changing to UTF-8 AND parsing the entire file and replacing all special above 126 characters to "xx" format. – Mikhail Aug 25 '10 at 02:19

score 0 · Answer 2 · answered Aug 24 '10 at 18:00

In addition to what Thierry-Dimitri Roy wrote, if you know the encoding you have to create your FileReader with a bit of extra work. From the docs:

Convenience class for reading character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

score 0 · Answer 3 · answered Aug 24 '10 at 18:00

The Javadoc for FileReader says:

The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

In your case the default character encoding is probably not appropriate. Find what encoding the input file uses, and specify it. For example:

FileInputStream fis = new FileInputStream(myFile);
InputStreamReader isr = new InputStreamReader(fis, "charset name goes here");
BufferedReader input = new BufferedReader(isr);

Character corruption going from BufferedReader to BufferedWriter in java

3 Answers3