I am having trouble using Java to identify and replacing a specific character that appears in a text file I have. It is a non-printable character, but it seems that Java renders it as – when outputting to the console.
It seems to be this character: https://www.fileformat.info/info/unicode/char/c296/index.htm
I'll translate:
Joe said: "I have a square. It is a circle".
You've made conflicting statements. Is it a non-printable character, or is it 슖 which is perfectly printable (see? I just printed it), or is it something completely different, which shows up as 0xC2 96 in the file and you've immediately jumped to the conclusion that this must mean it is 슖 because that has unicode number 0xC296?
Before you call this a 'bad file', it's just an encoded file and you simply need to apply the proper encoding to it.
Whenever bytes turn into characters or vice versa, charset conversion is ALWAYS applied. You can't not do it. Thus, in new String(bytes)
, yup, charset conversion is applied. Which one? Well, the 'platform default', which is just a funny way of saying 'the wrong answer'. You never want playform default. Don't ever call new String(bytes)
, it's a dumb method you should never use.
Unfortunately, plain text files just do not have an encoding tagged along with the data. You can't read .txt files unless you already know what encoding you have. If you don't know, you can't read it, or, if you don't know but you have a pretty good idea of what's in the file, you can go into Sherlock Holmes mode and attempt to figure it out.
You've told java that it is encoded with 'platform default', whatever that might be, (looks like ISO-8859-1 or Win-1252), and you get garbage out, but that's because you specified the wrong encoding, not because 'java is bad' or 'the file is bad'. Just specify the right encoding and all is right as rain.
Open the file with a text editor of some renown (such as SublimeText, cot editor, notepad++, etc), and play around with the encoding value until the text makes sense.
You must use your human brain (this is pretty hard AI for a computer to do!) and look at the file and make an educated guess. For example, if I see Mç~ller in a file where it seems logical that these are last names of european origin, that probably said Müller, so now I can either try to backsolve (look up the hex sequence, and toss tha sequence + ü in a web search engine and that'll usually tell you what to do), or just keep picking encodings in my editor until I see Müller appear, and now I know.
Thus, if you've done that, and truly you ahve determined that 슖 makes sense, okay, backsolving for that, the only encoding that makes sense here is UTF-16BE
. So, toss that in your editor, or use new String(thoseBytes, "UTF-16BE")
and see if the stuff makes sense now.
In no case should you be doing what these other answers are suggesting which is that you read the file with the wrong encoding and then try to clean up the epic mess that results. That's a bit like that Mr Bean sketch for training to paint a house:
- Get a can of paint.
- Get a stick of dynamite.
- Put the can in the middle of the room.
- Light the dynamite.
- Put the dynamite in the can and all done!
... and then clean up the mess and fix all the areas the explosion didn't catch and put out the fire.
Or maybe just, ya know, skip the dynamite and just buy a paintroller instead.
Same here. Just decode those bytes the right way in the first place instead of scraping globs of paint and dynamite wrapper off the walls.