Interpret a string from one encoding to another in java

Question

I've looked around for answers to this (I'm sure they're out there), and I'm not sure it's possible.

So, I got a HUGE file that contains the word "för". I'm using RandomAccessFile because I know where it is (kind of) and can therefore use the seek() function to get there.

To know that I've found it I have a String "för" in my program that I check for equality. Here's the problem, I ran the debugger and when I get to "för" what I get to compare is "fÃ¶r".

So my program terminates without finding any "för".

This is the code I use to get a word:

    private static String getWord(RandomAccessFile file) throws IOException {
    StringBuilder stb = new StringBuilder();
    String word;
    char c;
    c = (char)file.read();
    int end;
    do {
        stb.append(c);
        end = file.read();
        if(end==-1)
            return "-1";
        c = (char)end;

    } while (c != ' ');
    word = stb.toString();
    word.trim();
    return word;
}

So basically I return all the characters from the current point in the file to the first ' '-character. So basically I get the word, but since (char)file.read(); reads a byte (I think), UTF-8 'ö' becomes the two characters 'Ã' and '¶'?

One reason for this guess is that if I open my file with encoding UTF-8 it's "för" but if I open the file with ISO-8859-15 in the same place we now have exactly what my getWord method returns: "fÃ¶r"

So my question:

When I'm sitting with a "för" and a "fÃ¶r", is there any way to fix this? Like saying "read "fÃ¶r" as if it was an UTF-8 string" to get "för"?

Your problem is right here: `(char)file.read()`. The [`read()`](https://docs.oracle.com/javase/8/docs/api/java/io/RandomAccessFile.html#read--) method does *not* return a `char`. it returns a `byte`. Do not cast a `byte` to a `char`. --- Why are you using a `RandomAccessFile` and not a more helpful `FileReader`, that will automatically convert bytes to characters? — Andreas, Sep 01 '16 at 05:17
@Andreas RandomAccessFile has the function seek(long pos) allowing me to jump X number of bytes in the file without opening and reading what's before. — MrJalapeno, Sep 01 '16 at 05:25
But seeking might land you in the middle of a UTF-8 sequence, so how do you determine where to seek to? In UTF-8, characters take up a variable number of bytes, so you cannot know how many bytes to skip, unless you read them. — Andreas, Sep 01 '16 at 05:28
I think OP has been warned sufficiently. We can't help everyone who wants to do encoding or timezone calculations on their own. People have to learn the lesson the hard way, I guess. — Ingo Bürk, Sep 01 '16 at 05:41
@Andreas Thank you so much for your help. I'm afraid it's a big file and I have to search it in a very short time span meaning I have to use seek(long pos). You are absolutely right in the fact that this might land me in the middle of a UTF-8 sequence so what I'm looking right now is to perhaps write the file I'm looking at (which is something my program does before this happens) in ISO-8859-1 (so that each character is a byte) so that I then can efficiently use the seek-method. — MrJalapeno, Sep 01 '16 at 05:51

score 3 · Answer 1 · answered Sep 01 '16 at 06:31

If you have to use a RandomAccessFile you should read the content into a byte[] first and then convert the complete array to a String - somthing along the lines of:

byte[] buffer = new byte[whatever];
file.read(buffer);
String result = new String(buffer,"UTF-8");

This is only to give you a general impression what to do, you'll have to add some length-handling etc.

This will not work correctly if you start reading in the middle of a UTF-8 sequence, but so will any other method.

score 1 · Answer 2 · edited May 23 '17 at 10:29

1

You are using RandomAccessFile.read(). This reads single bytes. UTF-8 sometimes uses several bytes for one character.

Different methods to read UTF-8 from a RandomAccessFile are discussed here: Java: reading strings from a random access file with buffered input

If you don't necessarily need a RandomAccessFile, you should definitely switch to reading characters instead of bytes.

If possible, I would suggest Scanner.next() which searches for the next word by default.

edited May 23 '17 at 10:29

Community

1
1

answered Sep 01 '16 at 05:35

slartidan

20,403
15
83
131

@Andreas you should stay away from `Scanner`, _if performance matters_. – slartidan Sep 01 '16 at 10:34

score -1 · Accepted Answer · answered Sep 01 '16 at 04:43

-1

import java.nio.charset.Charset;
String encodedString = new String(originalString.getBytes("ISO-8859-15"), Charset.forName("UTF-8"));

answered Sep 01 '16 at 04:43

Sergey Gornostaev

7,596
3
27
39

After some googling (just a few seconds ago) I managed to implement what looks like a solution. It's basically: **byte[] utf8Bytes = theWord.getBytes("ISO-8859-1");** and then **theWord = new String(utf8Bytes, "UTF8");**. theWord has now gone from "fÃ¶r" to "för". Is there any reasons for doing it this way or that way? Just curious :) PS I just implemented your solution and it solves the problem as well so I'll accept your answer – MrJalapeno Sep 01 '16 at 04:56
My and your solutions are the same. The only difference is that my solution in a single line. – Sergey Gornostaev Sep 01 '16 at 05:03
1

ISO-8859-15 doesn't reverse the very bad `byte` to `char` cast done in the code. – Andreas Sep 01 '16 at 05:22
2

While it might work in this case converting encoding after already converting something to a `String` is bound to get you in trouble because information might already be lost in the first conversion of `byte` to `String`. The only right place to handle encoding-issues is while reading/writing. – piet.t Sep 01 '16 at 06:21
@piet.t I agree, but the author was asking how to convert a string from one encoding to another. – Sergey Gornostaev Sep 01 '16 at 06:34
...to which the correct answer is "this can't be done correctly". – piet.t Sep 01 '16 at 06:38
@piet the answer is really that a `String` in Java always has the same encoding, so the question doesn't make sense, or perhaps "this can't be done _at all_". – davmac Sep 01 '16 at 10:42

Interpret a string from one encoding to another in java

3 Answers3

Linked