Inconsistent Character encoding

Question

So I have this file in which the apostrophe and double quotes are not getting displayed properly. I tried changing the encoding to UTF-8 but still it didn't help.Problem is that the change is not consistent throughout, so I cannot simply replace the characters with apostrophe or double quotes. Please help me with this. So basically I want to read this text in java and do some further processing for NLP application. When I read these files in java by explicitly setting the encoding to UTF-8, I still get junk characters, though different from what I see in the file.

Here are two sample text :

It<92>s easy enough, however, to define oneself in whatever way one wants especially when no one in the media challenges you on it. The real test of moral courage is how one acts<97>not just talks<97>in real-life situations. And in the one concrete instance when the Illinois senator was called upon to stand up for justice, he was nowhere to be seen.

Another sample text :

I would have researched everything beforehand and known exactly what kind of tests to expect at each appointment and what the normal range is supposed to be for those tests. It?~@~Ys not that I don?~@~Yt worry that something will happen or that one or more of the tests will come back abnormal. I do. I thought that with all these good appointments I have had in the last few months, I would start feeling less fearful of something going wrong. But my fear level stays about the same.

Can you post some code on how you read the text to start with? Are you sure that the original text is indeed encoded in UTF-8 when you read it? — fge, Feb 13 '14 at 14:18
Here is the code: System.setProperty("file.encoding", "UTF-8"); BufferedReader in = new BufferedReader(new FileReader(fileName)); String line; while((line = in.readLine()) != null){ //do sth with text } — Lanc, Feb 13 '14 at 14:25
according to the answer at http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding?rq=1 this won't work. it looks like you need to specify the encoding separately for each file (rather than setting a system property), see http://stackoverflow.com/q/21735328/217324 — Nathan Hughes, Feb 13 '14 at 14:46
@NathanHughes thanks it helped me partially. So now my problem is that some of my files are not in UTF-8 but in other encoding (may be WINDOWS). So is there an api in java(or some other java based library) that will find out the encoding of text in java? So that i could detect the encoding and then read using that encoding? — Lanc, Feb 13 '14 at 16:53
@Lanc: the reason I posted this as a comment but not as an answer is because I don't know that part. I suspect that Guntram's answer is correct. — Nathan Hughes, Feb 13 '14 at 17:15

score 1 · Answer 1 · answered Feb 13 '14 at 14:19

These texts seem to be encoded differently - the 1st one seems to be windows-1252, the 2nd one is probably UTF-8 displayed a bit strangely. Which means there is no single way to read them that works for all of them.

The best you can try to do is to try to detect the file type - for example, if all non-7-bit-ascii character come in pairs, the first being in the 0xc0-0xff range, then it's probably UTF-8. If there is any first-after-ascii character in the range between 0x80 and 0xbf, then its NOT UTF-8. Unless you know the text is written in a non-latin script (russian, greek, ...), it's probably safe to assume windows-1252 whenever it's not wellformed UTF-8.

But this is guesswork, and the only way to ensure you're reading the texts correctly is determine the encoding of each of them first, maybe sort the texts into different folders depending on encoding, and use the correct encoding on each of them that you read.

Can I not convert the windows-1252 encoding to utf-8? Also even if its utf-8 why is java not able to read it properly? — Lanc, Feb 13 '14 at 14:39
I said the 2nd file looks like UTF-8, because every non-ascii character seems to be encoded in two bytes. There's no way to know for sure what's in the file, however, unless you hexdump it. It might as well have been UTF-8 at some point, then read and written by a program that handled it incorrectly. You could try the recode program to convert the files to something your input stream can handle once you know what it is. — Guntram Blohm, Feb 13 '14 at 19:38

Inconsistent Character encoding

1 Answers1