How to read UTF8 encoded file using RandomAccessFile?

Question

I have text file that was encoded with UTF8 (for language specific characters). I need to use RandomAccessFile to seek specific position and read from.

I want read line-by-line.

String str = myreader.readLine(); //returns wrong text, not decoded 
String str myreader.readUTF(); //An exception occurred: java.io.EOFException

score 19 · Answer 1 · edited Jan 23 '17 at 15:02

19

You can convert string, read by readLine to UTF8, using following code:

public static void main(String[] args) throws IOException {
    RandomAccessFile raf = new RandomAccessFile(new File("MyFile.txt"), "r");
    String line = raf.readLine();
    String utf8 = new String(line.getBytes("ISO-8859-1"), "UTF-8");
    System.out.println("Line: " + line);
    System.out.println("UTF8: " + utf8);
}

Content of MyFile.txt: (UTF-8 Encoding)

Привет из Украины

Console output:

Line: ÐÑÐ¸Ð²ÐµÑ Ð¸Ð· Ð£ÐºÑÐ°Ð¸Ð½Ñ
UTF8: Привет из Украины

edited Jan 23 '17 at 15:02

Matthieu

2,736
4
57
87

answered Dec 12 '15 at 21:26

picoworm

310
3
13

Thank you for posting your solution. Could you explain why `String UTF8 = new String(Line.getBytes("UTF-8"), "UTF-8");` isn't working? – thomasb Feb 09 '16 at 14:15
@thomasb `getBytes("UTF-8")` will transform the internal byte array. `ISO-8859-1` is "raw" encoding. – Matthieu Jan 23 '17 at 15:23
See [this question](https://stackoverflow.com/q/15925458) concerning the notion of "raw" encoding with ISO-8859-1. Note the difference between ISO/IEC 8859-1 (191 bytes defined) and ISO-8859-1 (256 bytes defined). You can find the definition of ISO-8859-1 in [RFC1345](https://tools.ietf.org/html/rfc1345) and see that control codes C0 and C1 are mapped onto the 65 unused bytes. – Ludovic Kuty Jan 30 '19 at 04:41
1

See my answer to get detailed infos on the correctness of the expression `new String(raf.readLine().getBytes("ISO-8859-1"), "UTF-8")`. – Ludovic Kuty Jan 30 '19 at 05:14

Edwin Dalorzo · Answer 2 · 2012-04-01T16:17:14.443

The API docs say the following for readUTF8

Reads in a string from this file. The string has been encoded using a modified UTF-8 format.

The first two bytes are read, starting from the current file pointer, as if by readUnsignedShort. This value gives the number of following bytes that are in the encoded string, not the length of the resulting string. The following bytes are then interpreted as bytes encoding characters in the modified UTF-8 format and are converted into characters.

This method blocks until all the bytes are read, the end of the stream is detected, or an exception is thrown.

Is your string formatted in this way?

This appears to explain your EOF exceptuon.

Your file is a text file so your actual problem is the decoding.

The simplest answer I know is:

try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("jedis.txt"),"UTF-8"))){

    String line = null;
    while( (line = reader.readLine()) != null){
        if(line.equals("Obi-wan")){
            System.out.println("Yay, I found " + line +"!");
        }
    }
}catch(IOException e){
    e.printStackTrace();
}

Or you can set the current system encoding with the system property file.encoding to UTF-8.

java -Dfile.encoding=UTF-8 com.jediacademy.Runner arg1 arg2 ...

You may also set it as a system property at runtime with System.setProperty(...) if you only need it for this specific file, but in a case like this I think I would prefer the OutputStreamWriter.

By setting the system property you can use FileReader and expect that it will use UTF-8 as the default encoding for your files. In this case for all the files that you read and write.

If you intend to detect decoding errors in your file you would be forced to use the InputStreamReader approach and use the constructor that receives an decoder.

Somewhat like

CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
BufeferedReader out = new BufferedReader(new InpuStreamReader(new FileInputStream("jedis.txt),decoder));

You may choose between actions IGNORE | REPLACE | REPORT

EDIT

If you insist in using RandomAccessFile, you would need to know the exact offset of the line that you are intending to read. And not only that, in order to read with readUTF() method, you should have written the file with writeUTF() method. Because this method, as JavaDocs stated above, expects a specific formatting in which the first 2 unsigned bytes represent the length in bytes of the UTF-8 string.

As such, if you do:

try(RandomAccessFile raf = new RandomAccessFile("jedis.bin", "rw")){

    raf.writeUTF("Luke\n"); //2 bytes for length + 5 bytes
    raf.writeUTF("Obiwan\n"); //2 bytes for length + 7 bytes
    raf.writeUTF("Yoda\n"); //2 bytes for lenght + 5 bytes

}catch(IOException e){
    e.printStackTrace();
}

You should not have any problems reading back from this file using the method readUTF(), as long as you can determine the offset of the given line that you want to read back.

If you'd open the file jedis.bin you would notice it is a binary file, not a text file.

Now, I know that "Luke\n" is 5 bytes in UTF-8 and "Obiwan\n" is 7 bytes in UTF-8. And that the writeUTF() method will insert 2 bytes in front of every one of these strings. Therefore, before "Yoda\n" there are (5+2) + (7+2) = 16 bytes.

So, I could do something like this to reach the last line:

try (RandomAccessFile raf = new RandomAccessFile("jedis.bin", "r")) {

    raf.seek(16);
    String val = raf.readUTF();
    System.out.println(val); //prints Yoda

} catch (IOException e) {
    e.printStackTrace();
}

But this will not work if you wrote the file with a Writer class because writers do not follow the formatting rules of the method writeUFT().

In a case like this, the best would be that your binary file would be formatted in such a way that all strings occupied the same amount of space (number of bytes, not number of characteres, because the number of bytes is variable in UTF-8 depending on the characters in your String), if not all the space is need it you pad it:

That way you could easily calculate the offset of a given line because they all would occupy the same amount of space.

I created this text file using BufferedWriter(new OutputStreamWriter(new FileOutputStream(..),encoding) where encoding is utf8 — kenny, Apr 01 '12 at 14:26
Then ou cannot use RandomAccessFile to read it back. You have to use a reader class like BufferedReader or FileReader, and read from the beginning until you reach the line in question — Edwin Dalorzo, Apr 01 '12 at 14:42
this is not efficient, i use seek to preform paging. If I use readers it will require me to reread whole file every time. — kenny, Apr 01 '12 at 15:44
@kenny if you know the exact offset location of your string and if it is formatted as the RandomAccessFile readUTF8 method suggests, then, as I explained first, you should not have any problems. Otherwise, you can read the bytes and apply encoding using one of the String constructors. As such, your file cannot only be a text file, you would have to treat it as a binary file. Let me extend my answer. — Edwin Dalorzo, Apr 01 '12 at 15:46
This is also a problem, i can't change how I write the file. So it will remain BufferedWriter. What do you mean I can read the bytes and apply encoding? — kenny, Apr 01 '12 at 16:06
@kenny The `readLine` method does not support unicode, and do you know how to calculate the offset of a given line in your file in order to seek the specific position you want to read with `readLine`? — Edwin Dalorzo, Apr 01 '12 at 16:22

score 4 · Answer 3 · answered Apr 01 '12 at 14:43

4

You aren’t going to be able to go at it this way. The seek function will position you by some number of bytes. There is no guarantee that you are aligned to a UTF-8 character boundary.

answered Apr 01 '12 at 14:43

tchrist

78,834
30
123
180

and if i use suggested argument java -Dfile.encoding=UTF-8 ? – kenny Apr 01 '12 at 15:16
2

@kenny UTF-8 encoding encodes characters with a variable number of bytes, therefore skipping to a byte offset within the file is probably going to fail (because as @tchrist mentioned) you may not be at the beginning of a character boundary when you get there. If you know the character offset you need, you can use `Reader.skip(long n)` to skip the number of characters. That should be encoding aware. Just be sure to set your character set on the `InputStreamReader`. – Brandon DuRette Apr 01 '12 at 15:44
2

Finding the next character in UTF-8 is easy. Just skip all the bytes in [0x80-0xBF], the first one not in that range will be the start of a character. (This is the self-synchronizing property, which Ken Thompson added to UTF-8). – ninjalj Apr 02 '12 at 18:52
@ninjalj This is helpful, but still doesn't allow to find the n-th character without looking at all the characters before. – Jens Schauder Jul 05 '13 at 10:56
You might want to create an index (some data structure which maps line number to byte offset) to position yourself on line n. That requires to read the file entirely one time. – Ludovic Kuty Jan 30 '19 at 04:46

score 2 · Answer 4 · edited Oct 07 '21 at 11:04

Once you are positioned on a given line (this means you have answered the first part of your problem, see @martinjs answer), you can read the whole line and make a String out of it using a statement given in the answer by @Matthieu. But to check if the statement in question is correct, we have to ask ourselves 4 questions. It is not self-evident.

Note that the problem of getting at the start of a line may require to analyze the text to build an index if you need to randomly and quickly access many lines.

The statement to read a line and turn it into a String is :

String utf8 = new String(raf.readLine().getBytes("ISO-8859-1"), "UTF-8");

What is a byte in UTF-8 ? That means which values are allowed. We'll see the question is in fact useless once we answer question 2.
readLine(). UTF-8 bytes → UTF-16 bytes ok ? Yes. Because UTF-16 gives a meaning to all the integers from 0 to 255 coded on 2 bytes if the most significant byte (MSB) is 0. This is guaranteed by readLine().
getBytes("ISO-8859-1"). Characters encoded in UTF-16 (Java String with 1 or 2 char (code unit) per character) → ISO-8859-1 bytes ok ? Yes. The code points of the characters in the Java string are ≤ 255 and ISO-8859-1 is a "raw" encoding which means it can encode every character as a single byte.
new String(..., "UTF-8"). ISO-8859-1 bytes → UTF-8 bytes ok ? Yes. Since the original bytes come from UTF-8 encoded text and have been extracted as is, they still represent text encoded in UTF-8.

Concerning the raw nature of ISO-8859-1 in which every byte (value 0 to 255) is mapped onto a character, I copy/paste below the comment I made on the answer by @Matthieu.

See this question concerning the notion of "raw" encoding with ISO-8859-1. Note the difference between ISO/IEC 8859-1 (191 bytes defined) and ISO-8859-1 (256 bytes defined). You can find the definition of ISO-8859-1 in RFC1345 and see that control codes C0 and C1 are mapped onto the 65 unused bytes of ISO/IEC 8859-1.

score 1 · Answer 5 · answered May 17 '16 at 11:51

I realise that this is an old question, but it still seems to have some interest, and no accepted answer.

What you are describing is essentially a data structures problem. The discussion of UTF8 here is a red herring - you would face the same problem using a fixed length encoding such as ASCII, because you have variable length lines. What you need is some kind of index.

If you absolutely can't change the file itself (the "string file") - as seems to be the case - you could always construct an external index. The first time (and only the first time) the string file is accessed, you read it all the way through (sequentially), recording the byte position of the start of every line, and finishing by recording the end-of-file position (to make life simpler). This can be achieved by the following code:

myList.add(0); // assuming first string starts at beginning of file
while ((line = myRandomAccessFile.readLine()) != null) {
    myList.add(myRandomAccessFile.getFilePointer());
}

You then write these integers into a separate file ("index file"), which you will read back in every subsequent time you start your program and intend to access the string file. To access the nth string, pick the nth and n+1th index from the index file (call these A and B). You then seek to position A in the string file and read B-A bytes, which you then decode from UTF8. For instance, to get line i:

myRandomAccessFile.seek(myList.get(i));
byte[] bytes = new byte[myList.get(i+1) - myList.get(i)];
myRandomAccessFile.readFully(bytes);
String result = new String(bytes, "UTF-8");

In many cases, however, it would be better to use a database such as SQLite, which creates and maintains the index for you. That way, you can add and modify extra "lines" without having to recreate the entire index. See https://www.sqlite.org/cvstrac/wiki?p=SqliteWrappers for Java implementations.

Or use an in-memory data structure like a map (`Map` and e.g. its implementation `HashMap`). It depends on the use case. This is an interesting problem and it combines two sub-problems : indexing with a data structure and UTF-8 decoding into `String`. Note that the indexing is useful only if you need to do many random accesses to specific lines (which is implied by the use of `RandomAccessFile`). — Ludovic Kuty, Jan 30 '19 at 05:25

score 1 · Answer 6 · answered Aug 13 '17 at 20:46

Reading the file via readLine() worked for me:

RandomAccessFile raf = new RandomAccessFile( ... );
String line;
while ((line = raf.readLine()) != null) { 
    String utf = new String(line.getBytes("ISO-8859-1"));
    ...
}

// my file content has been created with:
raf.write(myStringContent.getBytes());

score 1 · Answer 7 · answered May 26 '18 at 10:42

The readUTF() method of RandomAccessFile treats first two bytes from the current pointer as size of bytes, after the two bytes from current position, to be read and returned as string.

In order for this method to work, content should be written using writeUTF() method as it uses first two bytes after the current position for saving the content size and then writes the content. Otherwise, most of the times you will get EOFException.

See http://www.zoftino.com/java-random-access-files for details.

kevinarpe · Answer 8 · 2015-10-23T11:35:04.993

I find the API for RandomAccessFile is challenging.

If your text is actually limited to UTF-8 values 0-127 (the lowest 7 bits of UTF-8), then it is safe to use readLine(), but read those Javadocs carefully: That is one strange method. To quote:

This method successively reads bytes from the file, starting at the current file pointer, until it reaches a line terminator or the end of the file. Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.

To read UTF-8 safely, I suggest you read (some or all of the) raw bytes with a combination of length() and read(byte[]). Then convert your UTF-8 bytes to a Java String with this constructor: new String(byte[], "UTF-8").

To write UTF-8 safely, first convert your Java String to the correct bytes with someText.getBytes("UTF-8"). Finally, write the bytes using write(byte[]).

How to read UTF8 encoded file using RandomAccessFile?

8 Answers8

Content of MyFile.txt: (UTF-8 Encoding)

Console output:

Linked