0

I am having trouble using Java to identify and replacing a specific character that appears in a text file I have. It is a non-printable character, but it seems that Java renders it as – when outputting to the console.

It seems to be this character: https://www.fileformat.info/info/unicode/char/c296/index.htm

Here is what I have done:

  1. I made a copy of the file and deleted everything from it except for the single character that I am struggling with.
  2. Opened the file in UltraEdit. It appeared to be an empty file.
  3. Changed UltraEdit to "hex mode", now it shows up as two characters: – with a hex value of 0xC296 (or "C2" for the  character, and "96" for the "–" character).
  4. I wrote the Java program below in an effort to change this character to something printable, but I have been unsuccessful.

Here is the code:

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;

public class FileTester {
    public static void main(String[] args) throws IOException {

        String filePath = "c:/temp/bad-file.txt";
        byte[] data = Files.readAllBytes(Paths.get(filePath));
        System.out.println("Array 0: " + data[0]);
        System.out.println("Array 1: " + data[1]);
        String content = new String(data);
    
        System.out.println(content);
        System.out.println(content.replace("0xC296", "BadCharacter"));
        System.out.println(content.replace("0xec8a96", "BadCharacter"));
        System.out.println(content.replace("\uC296", "BadCharacter"));
    }
}

Here is the output:

Array 0: -62
Array 1: -106
–
–
–
–

Here is a picture of how UltraEdit shows this file in hex mode:

Please let me know what I am doing wrong.

Joe
  • 113
  • 4
  • 12

3 Answers3

4

This is a character set issue. Make sure you read the file with the correct character set.

In character set Windows-1252, bytes C2 96 are characters  (U+00C2: Latin Capital Letter A with Circumflex) and (U+2013: En Dash).

In character set ISO 8859-1, byte 96 is undefined. Byte C2 is Â, same as for Windows-1252.

In character set UTF-8, bytes C2 96 is the encoding of Unicode code point U+0096 (<Start of Guarded Area> (SPA)).

In character set UTF-16BE, bytes C2 96 is the encoding of character (U+C296: not valid). In character set UTF-16LE, they would decode as character (U+96C2: Han Character).

The question code uses new String(byte[]), which uses the platform default character set, so it is unclear from the code alone which character set is being used.

Since it's running on Windows, and prints as –, it would appear to be using character set Windows-1252. As such, to replace the pair of characters that resulted from reading the file using that character set, use:

content.replace("\u00C2\u2013", "BadCharacter")

If the Java code had read the file using UTF-8, by calling new String(data, StandardCharsets.UTF_8), the code should be:

content.replace("\u0096", "BadCharacter")

FYI: UltraEdit likely opened the file using UTF-8, which is why it appeared to be an empty file. See "Unicode text and Unicode files in UltraEdit" to learn more about how UltraEdit handles Unicode files.

Andreas
  • 154,647
  • 11
  • 152
  • 247
1

I am having trouble using Java to identify and replacing a specific character that appears in a text file I have. It is a non-printable character, but it seems that Java renders it as – when outputting to the console.

It seems to be this character: https://www.fileformat.info/info/unicode/char/c296/index.htm

I'll translate:

Joe said: "I have a square. It is a circle".

You've made conflicting statements. Is it a non-printable character, or is it 슖 which is perfectly printable (see? I just printed it), or is it something completely different, which shows up as 0xC2 96 in the file and you've immediately jumped to the conclusion that this must mean it is 슖 because that has unicode number 0xC296?

Before you call this a 'bad file', it's just an encoded file and you simply need to apply the proper encoding to it.

Whenever bytes turn into characters or vice versa, charset conversion is ALWAYS applied. You can't not do it. Thus, in new String(bytes), yup, charset conversion is applied. Which one? Well, the 'platform default', which is just a funny way of saying 'the wrong answer'. You never want playform default. Don't ever call new String(bytes), it's a dumb method you should never use.

Unfortunately, plain text files just do not have an encoding tagged along with the data. You can't read .txt files unless you already know what encoding you have. If you don't know, you can't read it, or, if you don't know but you have a pretty good idea of what's in the file, you can go into Sherlock Holmes mode and attempt to figure it out.

You've told java that it is encoded with 'platform default', whatever that might be, (looks like ISO-8859-1 or Win-1252), and you get garbage out, but that's because you specified the wrong encoding, not because 'java is bad' or 'the file is bad'. Just specify the right encoding and all is right as rain.

Open the file with a text editor of some renown (such as SublimeText, cot editor, notepad++, etc), and play around with the encoding value until the text makes sense.

You must use your human brain (this is pretty hard AI for a computer to do!) and look at the file and make an educated guess. For example, if I see Mç~ller in a file where it seems logical that these are last names of european origin, that probably said Müller, so now I can either try to backsolve (look up the hex sequence, and toss tha sequence + ü in a web search engine and that'll usually tell you what to do), or just keep picking encodings in my editor until I see Müller appear, and now I know.

Thus, if you've done that, and truly you ahve determined that 슖 makes sense, okay, backsolving for that, the only encoding that makes sense here is UTF-16BE. So, toss that in your editor, or use new String(thoseBytes, "UTF-16BE") and see if the stuff makes sense now.

In no case should you be doing what these other answers are suggesting which is that you read the file with the wrong encoding and then try to clean up the epic mess that results. That's a bit like that Mr Bean sketch for training to paint a house:

  1. Get a can of paint.
  2. Get a stick of dynamite.
  3. Put the can in the middle of the room.
  4. Light the dynamite.
  5. Put the dynamite in the can and all done!

... and then clean up the mess and fix all the areas the explosion didn't catch and put out the fire.

Or maybe just, ya know, skip the dynamite and just buy a paintroller instead.

Same here. Just decode those bytes the right way in the first place instead of scraping globs of paint and dynamite wrapper off the walls.

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
-1
private static String cleanTextContent(String text) 
    {
    // strips off all non-ASCII characters
    text = text.replaceAll("[^\\x00-\\x7F]", "");

    // erases all the ASCII control characters
    text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
 
    // removes non-printable characters from Unicode
    text = text.replaceAll("\\p{C}", "");

    return text.trim();
    }

You can remove or replace non-printable characters by using regex.

noah1400
  • 1,282
  • 1
  • 4
  • 15