0

I am reading plain text data from several files, and feeding this data into a natural language processing (NLP) module. The NLP module can't handle all unicode characters, so I am using the following code to convert the text to UTF-8 encoding:

byte[] encoded = Files.readAllBytes(path);
return StandardCharsets.UTF_8.decode(ByteBuffer.wrap(encoded)).toString();

where path is the location of the text file I want.

However, the NLP module throws an error because it keeps encountering � (U+FFFD, decimal: 65533). From the javadoc, I see that

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.

Then why does it retain the '�' character?

Chthonic Project
  • 8,216
  • 1
  • 43
  • 92
  • You're not converting *to* UTF-8. You're decoding the data as if it *is* UTF-8. That's not the same thing at all, and it's not clear why you'd expect that converting the data *to* UTF-8 would help a module that can't handle all of Unicode... What encoding are your text files actually in? – Jon Skeet Feb 14 '14 at 18:02
  • I haven't checked that, actually. I am working on a dataset scraped from several news websites, so there needs to be a lot of preprocessing done before I can feed this data to the NLP module. Thank you for correcting my mistake. Is [this answer on SO](http://stackoverflow.com/questions/88838/how-to-convert-strings-to-and-from-utf8-byte-arrays-in-java) what I should be doing? – Chthonic Project Feb 14 '14 at 18:05
  • Not really. You need to work out the encoding (which would hopefully have been in the HTTP headers of the pages you fetched) of the files... at that point you should be able to load them into Java performing appropriate decoding. Whether you *then* need to perform any trimming to avoid characters that your NLP module can't handle is a different question - but get the text right first. – Jon Skeet Feb 14 '14 at 18:26

0 Answers0