2

I have created code to read data from a csv file. However, I cannot handle special characters such as £.

For example, My Base Cost (K£) is being read as My Base Cost (K£).

What can I do to correct this?

public void parseCSVFile(String filename){

     try {
            br = new BufferedReader(new FileReader(csvDirectory + filename));

            while ((parsedLines = br.readLine()) != null) {

                String[] parsedData = parsedLines.split(csvSplitByComma);

                entireFeed.add(parsedData[0]);
                entireFeed.add(parsedData[1]);

                System.out.println(parsedData[0]);
                System.out.println(parsedData[1]);

                it = entireFeed.iterator();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
}
NSC
  • 33
  • 1
  • 5
  • 2
    Possible duplicate of http://stackoverflow.com/questions/9281629/read-special-characters-in-java-with-bufferedreader – Niranjan Kumar Nov 16 '16 at 14:10
  • @NiranjanKumar I tried the following and it still did not work. I get back "My Base Cost (K£)": BufferedReader br = new BufferedReader( new InputStreamReader(new FileInputStream(file),"ISO-8859-1")); – NSC Nov 16 '16 at 14:16
  • Possible duplicate of [Read/write .txt file with special characters](http://stackoverflow.com/questions/4597749/read-write-txt-file-with-special-characters) – Aleksandr Erokhin Nov 16 '16 at 14:21
  • @AlexErohin I have tried this and still get the original error. – NSC Nov 16 '16 at 14:27

2 Answers2

5

The code that wrote your CSV is broken. It triple-encoded in UTF-8 the text it wrote.

In UTF-8, ASCII characters (codepoints 0–127) are represented as single bytes; they need no encoding. That’s why only £ is affected.

£ requires two bytes in UTF-8. Those bytes are: 0xc2, 0xa3. If the code that wrote your CSV file had used UTF-8 properly, the character would appear as those two bytes in the file.

But, apparently, some code somewhere read the file using a one-byte charset (like ISO-8859-1), causing each individual byte to be treated like its own character. It then used UTF-8 to encode those individual characters. Meaning, it took the { 0xc2, 0xa3 } bytes and encoded each of them in UTF-8. That in turn produced these bytes: 0xc3, 0x82, 0xc2, 0xa3. (Specifically: The U+00C2 character is represented in UTF-8 as 0xc3 0x82, and the U+00A3 character is represented in UTF-8 as 0xc2 0xa3.)

Then, sometime after that, the same thing was done again. Those four bytes were read using a one-byte charset, each byte was treated as its own character, and each of those four characters was encoded in UTF-8, which resulted in eight bytes: 0xc3, 0x83, 0xc2, 0x82, 0xc3, 0x82, 0xc2, 0xa3. (Not every character is converted to two bytes when encoded as UTF-8; it just happens that all of these characters are.)

This is why, when you read the file using the ISO-8859-1 charset, you get one character for each byte:

à   ƒ      ‚   à   ‚      £
c3  83  c2  82  c3  82  c2  a3

(Technically, is actually U+201A "Single Low-9 Quotation Mark," but many one-byte-per-character Windows fonts have historically had that character at position 0x82.)

So, now that we know how your file got that way, what do you do about it?

First, stop making it worse. If you have control over the code that’s writing the file, make sure that code explicitly specifies a charset for both reading and writing. UTF-8 is almost always the best choice, at least for any file using predominantly western characters.

Second, how does one fix the file? There is no way to automatically detect this mis-encoding, I’m afraid, but at least in the case of this one file, you can triple-decode it.

If the file is not very large, you can just read it all into memory:

byte[] bytes = Files.readAllBytes(Paths.get(csvDirectory, filename));
// First decoding: £ is represented as four characters
String content = new String(bytes, "UTF-8");

bytes = new byte[content.length()];
for (int i = content.length() - 1; i >= 0; i--) {
    bytes[i] = (byte) content.charAt(i);
}
// Second decoding: £ is represented as two characters
content = new String(bytes, "UTF-8");

bytes = new byte[content.length()];
for (int i = content.length() - 1; i >= 0; i--) {
    bytes[i] = (byte) content.charAt(i);
}
// Third decoding: £ is represented as one character
content = new String(bytes, "UTF-8");

br = new BufferedReader(new StringReader(content));

// ...

If it’s a large file, you will want to read each line as bytes:

try (InputStream in = new BufferedInputStream(
    Files.newInputStream(Paths.get(csvDirectory, filename)))) {

    ByteBuffer lineBuffer = ByteBuffer.allocate(64 * 1024);

    int b = 0;
    while (b >= 0) {
        lineBuffer.clear();

        for (b = in.read();
             b >= 0 && b != '\n' && b != '\r';
             b = in.read()) {

            lineBuffer.put((byte) b);
        }

        if (b == '\r') {
            in.mark(1);
            if (in.read() != '\n') {
                in.reset();
            }
        }

        lineBuffer.flip();
        byte[] bytes = new byte[lineBuffer.limit()];
        lineBuffer.get(bytes);

        // First decoding: £ is represented as four characters
        String parsedLine = new String(bytes, "UTF-8");

        bytes = new byte[parsedLine.length()];
        for (int i = parsedLine.length() - 1; i >= 0; i--) {
            bytes[i] = (byte) parsedLine.charAt(i);
        }
        // Second decoding: £ is represented as two characters
        parsedLine = new String(bytes, "UTF-8");

        bytes = new byte[parsedLine.length()];
        for (int i = parsedLine.length() - 1; i >= 0; i--) {
            bytes[i] = (byte) parsedLine.charAt(i);
        }
        // Third decoding: £ is represented as one character
        parsedLine = new String(bytes, "UTF-8");

        // ...
    }
}
VGR
  • 40,506
  • 4
  • 48
  • 63
  • thanks for the explanation, it makes sense as to where I was going wrong. I have rectified my code and it is now working as expected. – NSC Nov 17 '16 at 11:34
3

Seems like an encoding problem. Find out the charset that your file is encoded. Assumed that the encoding is in UTF-8 you can do something like this

new BufferedReader(new InputStreamReader(new FileInputStream("my/path/to/File"), "UTF-8"));

This should solve your problem

Michael Gantman
  • 7,315
  • 2
  • 19
  • 36