0

I'm working with some files that might be either UTF-8 or ANSI (Cp1252 specifically), and I need to load them, make some edits, and then output the file again with the original encoding. However, I haven't had any luck getting my program to output ANSI at all.

My code for loading the text is a simple Scanner with a charsetName specified

fileScanner = new Scanner(f, CHARACTER_SET);

My current code for writing the file is the following:

BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), CHARACTER_SET));
writer.write(this.toString());
System.out.println("Writing " + name + " (" + method + ") using " + CHARACTER_SET + " encoding");
writer.close();

CHARACTER_SET is a String that is either "UTF8" or "windows-1252" depending on which encoding I detected the file to be when loading it.

The file actually outputs just fine in either mode, with all the special accent characters I've encountered being uncorrupted. The problem is that if I work on an Cp1252 file, it will output it as UTF-8 even though I initialized the BufferedWriter with a Cp1252 OutputStreamWriter. I can verify this since the encoding was set via CHARACTER_SET, and I print out CHARACTER_SET right afterwards showing that for ANSI files it used Cp1252. I'm checking the encoding of the output by loading it in Notepad++ and seeing what it says in the bottom right.

It know seems like I'm splitting hairs a little, but I really do want to leave the file with its original encoding.

theolaa
  • 21
  • 2
  • It behaves exactly the same way whether I use the java.io name or the java.nio name (for both encodings). – theolaa Jun 19 '23 at 01:39
  • 1
    What are you using the read the data, can you add that code? – Reilas Jun 19 '23 at 01:41
  • What is `this` and please show that class's `toString()` method. – President James K. Polk Jun 19 '23 at 01:45
  • @Reilas I've updated the question with that info – theolaa Jun 19 '23 at 01:45
  • @PresidentJamesK.Polk The `this` is a class representing a custom data structure. Its toString() method serializes it and returns it as a String. – theolaa Jun 19 '23 at 01:47
  • @theolaa, have a look at this [question and answer](https://stackoverflow.com/questions/36910513/js-java-writing-file-using-outputstreamwriter-and-utf-8-parameter-result-in-an). It looks like _Notepad++_ might be relaying the incorrect encoding. – Reilas Jun 19 '23 at 01:49
  • Yes, I understand that, but perhaps the bug is in your `toString()` method. – President James K. Polk Jun 19 '23 at 01:50
  • @Reilas Intriguing. The original file is reported by Notepad++ as ANSI, but the file being output is reported as UTF-8. I wonder what could cause the discrepancy if it is simply a matter of Notepad++ guessing incorrectly. – theolaa Jun 19 '23 at 01:53
  • @PresidentJamesK.Polk toString() simply calls the following method on the root node of the data structure. Here is a formatted screenshot: https://imgur.com/a/NZH2GTH Here is a pastebin of the text: https://pastebin.com/kHMXM827 – theolaa Jun 19 '23 at 01:59
  • @Reilas Unfortunately it seems that the output is indeed UTF-8, because if I change Notepad++'s encode setting to ANSI (not converting the file, just viewing it while assuming a different encoding), it turns the special characters gibberish. Same for viewing it as Cp1252 directly rather than ANSI. – theolaa Jun 19 '23 at 02:10

1 Answers1

0

Well, I'm not 100% sure how this works but I changed my write statement to the following

writer.write(new String(this.toString().getBytes(Charset.forName(CHARACTER_SET))));

and now it works.

I think what's happening is that file contents were being loaded correctly, but then re-encoded by Java's internal String format. In order to have it write the file in the format I wanted, I had to convert the text from Java's format into Cp1252 before printing it, even though I initially loaded it as Cp1252.

In conclusion, it seems that the issue was not with loading the text, or setting up the BufferedWriter, but rather it was with the text I was telling the BufferedWriter to write.

theolaa
  • 21
  • 2