The code that wrote your CSV is broken. It triple-encoded in UTF-8 the text it wrote.
In UTF-8, ASCII characters (codepoints 0–127) are represented as single bytes; they need no encoding. That’s why only £
is affected.
£
requires two bytes in UTF-8. Those bytes are: 0xc2, 0xa3. If the code that wrote your CSV file had used UTF-8 properly, the character would appear as those two bytes in the file.
But, apparently, some code somewhere read the file using a one-byte charset (like ISO-8859-1), causing each individual byte to be treated like its own character. It then used UTF-8 to encode those individual characters. Meaning, it took the { 0xc2, 0xa3 } bytes and encoded each of them in UTF-8. That in turn produced these bytes: 0xc3, 0x82, 0xc2, 0xa3. (Specifically: The U+00C2 character is represented in UTF-8 as 0xc3 0x82, and the U+00A3 character is represented in UTF-8 as 0xc2 0xa3.)
Then, sometime after that, the same thing was done again. Those four bytes were read using a one-byte charset, each byte was treated as its own character, and each of those four characters was encoded in UTF-8, which resulted in eight bytes: 0xc3, 0x83, 0xc2, 0x82, 0xc3, 0x82, 0xc2, 0xa3. (Not every character is converted to two bytes when encoded as UTF-8; it just happens that all of these characters are.)
This is why, when you read the file using the ISO-8859-1 charset, you get one character for each byte:
à ƒ  ‚ à ‚  £
c3 83 c2 82 c3 82 c2 a3
(Technically, ‚
is actually U+201A "Single Low-9 Quotation Mark," but many one-byte-per-character Windows fonts have historically had that character at position 0x82.)
So, now that we know how your file got that way, what do you do about it?
First, stop making it worse. If you have control over the code that’s writing the file, make sure that code explicitly specifies a charset for both reading and writing. UTF-8 is almost always the best choice, at least for any file using predominantly western characters.
Second, how does one fix the file? There is no way to automatically detect this mis-encoding, I’m afraid, but at least in the case of this one file, you can triple-decode it.
If the file is not very large, you can just read it all into memory:
byte[] bytes = Files.readAllBytes(Paths.get(csvDirectory, filename));
// First decoding: £ is represented as four characters
String content = new String(bytes, "UTF-8");
bytes = new byte[content.length()];
for (int i = content.length() - 1; i >= 0; i--) {
bytes[i] = (byte) content.charAt(i);
}
// Second decoding: £ is represented as two characters
content = new String(bytes, "UTF-8");
bytes = new byte[content.length()];
for (int i = content.length() - 1; i >= 0; i--) {
bytes[i] = (byte) content.charAt(i);
}
// Third decoding: £ is represented as one character
content = new String(bytes, "UTF-8");
br = new BufferedReader(new StringReader(content));
// ...
If it’s a large file, you will want to read each line as bytes:
try (InputStream in = new BufferedInputStream(
Files.newInputStream(Paths.get(csvDirectory, filename)))) {
ByteBuffer lineBuffer = ByteBuffer.allocate(64 * 1024);
int b = 0;
while (b >= 0) {
lineBuffer.clear();
for (b = in.read();
b >= 0 && b != '\n' && b != '\r';
b = in.read()) {
lineBuffer.put((byte) b);
}
if (b == '\r') {
in.mark(1);
if (in.read() != '\n') {
in.reset();
}
}
lineBuffer.flip();
byte[] bytes = new byte[lineBuffer.limit()];
lineBuffer.get(bytes);
// First decoding: £ is represented as four characters
String parsedLine = new String(bytes, "UTF-8");
bytes = new byte[parsedLine.length()];
for (int i = parsedLine.length() - 1; i >= 0; i--) {
bytes[i] = (byte) parsedLine.charAt(i);
}
// Second decoding: £ is represented as two characters
parsedLine = new String(bytes, "UTF-8");
bytes = new byte[parsedLine.length()];
for (int i = parsedLine.length() - 1; i >= 0; i--) {
bytes[i] = (byte) parsedLine.charAt(i);
}
// Third decoding: £ is represented as one character
parsedLine = new String(bytes, "UTF-8");
// ...
}
}