file reading encoding trouble

Question

I've a file to read save, do something with its informations and then rewrite them back to another file. the problem is that the original file contains some characters from asian languages like 坂本龍一, 東京事変 and メリー (I guess they're chinese, japanese and korean). I can see them using Notepad++.

the problem is when I read them and write those things via java they get corrupted and I see weird stuff in my output file like ???????? or Ð–Ð°Ð½Ð½Ð° Ð‘Ð¸Ñ‡ÐµÐ²Ñ?ÐºÐ°Ñ? I think I got something wrong with the encoding but I've no idea of which to use and how to use it.

can someone help me? here's my code:

    String fileToRead= SONG_2M;
            Scanner scanner = new Scanner(new File(fileToRead), "UTF-8");

            while (scanner.hasNextLine()) {

                String line = scanner.nextLine();
                String[] songData = line.split("\t");
                if (/*something*/) {
                    save the string in the map
                }
            }
            scanner.close();

            saveFile("coded_artist_small2.txt");
}

    public void saveFile(String fileToSave) throws FileNotFoundException, UnsupportedEncodingException {
            PrintWriter writer = new PrintWriter(fileToSave, "UTF-8");

            for (Entry<String, Integer> entry : artistsMap.entrySet()) {
                writer.println(entry.getKey() + DELIMITER + entry.getValue());
            }

            writer.close();
        }

Encoding and Decoding must follow same mechanism, Where you get that file any way? — Sarz, Dec 15 '14 at 10:42
Additionally, your code is hard to read when it's formatted like this, it's incomplete, and it's clearly doing more than it needs to just to demonstrate the problem. Please provide a *short but complete* program (properly formatted) Which demonstrates the problem. — Jon Skeet, Dec 15 '14 at 10:46
I don't know what encoding use the file, notepad++ says it's in UTF-8 (or at least it read it using this charset). The file is provided by my professor, it's a university project (which has nothing to do with encoding :P ) it works all fine except for this problem and I want to solve it before submitting the project I just edited the question to make it more readable. sorry for the bad formatted version — jack_the_beast, Dec 15 '14 at 10:56

score 0 · Answer 1 · edited May 23 '17 at 10:33

0

It is likely that your input file is not, in fact, encoded in UTF-8 (an encoding using two bytes per character satisfying the unicode standard). For instance, the character 坂 you are seeing is unicode 0x5742. If, in fact, your file is encoded in ASCII, that should be displayed as character 0x57 followed by 0x42 - i.e. 9*.

If you're unsure of your file's encoding - take a guess that it might be ASCII text. Try removing the encoding when you set up the Scanner i.e. make the second line of your code

Scanner scanner = new Scanner(new File(fileToRead));

If, in fact, you know the file is unicode, there are different encodings. See this answer for a more comprehensive unicode reader - dealing with various unicode encodings.

For your output - you need to decide how you want the file encoded : some unicode encoding (e.g. UTF-8) or as ASCII.

edited May 23 '17 at 10:33

Community

1
1

answered Dec 15 '14 at 10:56

J Richard Snape

20,116
5
51
79

I already tried to read/write it without specifying an encoding but I still get weird results – jack_the_beast Dec 15 '14 at 10:58
OK : Following your question edit - I realise that is probably not the problem you're having. You need a little more detective work to track where the bad encoding is being introduced - put some debug output (e.g. System.out.println statements) to output variable **line** before you split it and the **Sting** key you use to add the entry to the **Map** in your loop. If they look OK on the console, then it's the output that is corrupting things, if not - the input stage is what we need to look at. – J Richard Snape Dec 15 '14 at 11:14
I checked both debug and console: with UTF-8 reading mode, the variable in debug have the correct value (`マキシマム・ザ・ホルモン`) but the console shows question marks. without using encoding both debug and console show the same value but it's incorrect (`ä¸‹æ?‘é™½å?`). those values are mantained trough all the data structures I used – jack_the_beast Dec 15 '14 at 21:31
Basically, what I'm saying is that it looks like you are viewing your output (which is UTF-8 encoded) in an ASCII viewer. This will give either ?????s, or something like `ãƒžã‚ã‚·ãƒžãƒ ãƒ»ã‚¶ãƒ»ãƒ›ãƒ«ãƒ¢ãƒ³ 5 English artist 8 å‚æœ¬é¾ä¸€ 15` for the example I gave. – J Richard Snape Dec 16 '14 at 09:58
ok that could be. however I'm running eclipse on Widows. I can't check it out now if there's some settings to change the console encoding. however i don't care about the console, the problem is that the output file it's incorrect. – jack_the_beast Dec 16 '14 at 10:18
Sure - I get it. My point was that the output file in my case came out fine with exactly your code, so your code is fine. I think you are viewing your output file using an ASCII viewer. I use Eclipse under Windows too. There is a setting for encoding in both Eclipse viewer and Notepad++. In Eclipse on windows - you can right click the text file, click properties on menu and alter text file encoding on that screen at the bottom - your output looks like its being viewed in ISO-8859-1 where it should be UTF-8. I Notepad++ its the encoding menu - from your Q you're familiar with that. – J Richard Snape Dec 16 '14 at 10:33
After checking all the code for the third time and almost an hour of debug I can say that most of the artists names are now written to the output file correctly except for about twenty of them that are still incorrect. honestly I don't want to debug 175k+ artists, so i think i will accept this results and my professor don't get upset for that. thank you for your help – jack_the_beast Dec 16 '14 at 23:28
No problem - sorry you couldn't solve it completely. If you do keep working on it - happy to have a look at the ~20 problematic lines - if you post the exact format of those lines of your input file. If not - I hope your prof is impressed :) – J Richard Snape Dec 18 '14 at 10:29
that can be a problem, cause the original file is very large (more than 2,5gb!!) so there's a lot of problems searching trough it. I guess i'll leave it as it is :) – jack_the_beast Dec 18 '14 at 10:36

file reading encoding trouble

1 Answers1