I am reading plain text data from several files, and feeding this data into a natural language processing (NLP) module. The NLP module can't handle all unicode characters, so I am using the following code to convert the text to UTF-8 encoding:
byte[] encoded = Files.readAllBytes(path);
return StandardCharsets.UTF_8.decode(ByteBuffer.wrap(encoded)).toString();
where path
is the location of the text file I want.
However, the NLP module throws an error because it keeps encountering � (U+FFFD, decimal: 65533)
. From the javadoc, I see that
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.
Then why does it retain the '�' character?