2

I am using RandomAccessFile to read some informations from a large file. RandomAccessFile has a method seek that points the cursor to a specific part of the file that I want to read the whole line. To read this line I use readLine() method.

I read this whole file before and then created an index that allows me to access the begginning of any line with seek method. This index works fine. I created this index based on this answer: https://stackoverflow.com/a/42077860/763368

Since I have to do lots of access in this file, performance is an important issue to take care, then I am looking for other options to read the file going to an specific line and getting the whole line.

I read that FileChannel with MappedByteBuffer is a good option to quickly read files, but I didn't see any solution that does what I want.

P.S.: the lines have different lengths and I don't know this lengths.

Does anybody have a good solution?

Edit:

The file I want read has follow format: key\tvalue

The index is a hashmap with all the keys of that file been keys and the values is the byte position(Long).

Let's suppose I want go to the line with the key "foo", then I must seek to the value position, like this:

raf.seek(index.get("foo"))

If I use raf.readLine() the return will be the whole line with the key "foo".

But I don't want to use the RandomAccessFile for this work because it is too slow.

That is the way I am doing now in Scala:

val raf = new RandomAccessFile(file,"r")  
raf.seek(position.get(key))
println(raf.readLine)
raf.close
Community
  • 1
  • 1
Marcelo Machado
  • 1,179
  • 2
  • 13
  • 33
  • 2
    Are you accessing different files? if not, why do you close the file access? If you keep the file access open you don't have to wait for the OS to give you read permission. – Tschallacka Feb 21 '17 at 19:21
  • @Tschallacka I am only closing in the end of all readings, this is just an example. But my problem here is the way to read the file. – Marcelo Machado Feb 21 '17 at 19:28
  • Can you provide the code of your index reading and how you translate it to a seek position. Because you're already on a good path, your index seeking might benefit from some optimisation, but without the full code and sample data it's hard to help. – Tschallacka Feb 21 '17 at 19:43
  • @Tschallacka I edited my question, please take a look. – Marcelo Machado Feb 21 '17 at 20:02

1 Answers1

2

If you already have to read through the file once to find the indices of the keys, the absolutely fastest solution would be to read the lines and keep them in memory. If that doesn't work for some reason (e.g. memory constraints), using buffers can indeed be a good alternative. This is an outline of the code:

FileChannel channel = new RandomAccessFile("/some/file", "r").getChannel();

long pageSize = ...; // e.g. "3 GB or file size": max(channel.size(), THREE_GB); 
long position = 0;
ByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, position, pageSize);

ByteBuffer slice;
int maxLineLength = 30;
byte[] lineBuffer = new byte[maxLineLength];

// Read line at indices 20 - 25
buffer.position(20);
slice = buffer.slice();
slice.get(lineBuffer, 0, 6);
System.out.println("Starting at 20:" + new String(lineBuffer, Charset.forName("UTF8")));

// Read line at indices 0 - 10
buffer.position(0);
slice = buffer.slice();
slice.get(lineBuffer, 0, 11);
System.out.println("Starting at 0:" + new String(lineBuffer, Charset.forName("UTF8")));

This code can also be used for very large files. Just call channel.map to find the "page" where your key is located: position = keyIndex / pageSize * pageSize and then call buffer.position from that index: keyIndex - position

If you really don't have any way to group access to one "page" together, then you don't need the slice. Performance won't be as good, but this allows you to simplify the code further:

byte[] lineBuffer = new byte[maxLineLength];
// ...
ByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, keyIndex, lineLength);
buffer .get(lineBuffer, 0, lineLength);
System.out.println(new String(lineBuffer, Charset.forName("UTF8")));

Note that the ByteBuffer is not created on the JVM heap, but is actually a memory mapped file at the OS level. (As of Java 8, you can verify this, by looking at the source code and searching for sun.nio.ch.DirectBuffer in the implementation).

Line size: The best way to get the line size is to store it when you scan through the file, i.e. use Map[String, (Long, Int)] instead of what you are using for index now. If that doesn't work for you, you should run some tests to find out what is faster:

  • Just store the maximum line size and then search for a line break in the string of this maximum length. In this case, pay attention you cover accessing the end of the file in your unit tests.
  • Scan ahead with ByteBuffer.get until you hit an \n. If you have true Unicode files, this is probably not an option, since the Ascii code for the line break (0x0A) can appear elsewhere, for example in the UTF-16 encoded Korean syllable with the character code 0xAC0A.

This would be the Scala code for the second approach:

// this happens once
val maxLineLength: Long = 2000 // find this in your initial sequential scan
val lineBuffer = new Array[Byte](maxLineLength.asInstanceOf[Int])

// this is how you read a key
val bufferLength = maxLineLength min (channel.size() - index("key"))
val buffer = channel.map(FileChannel.MapMode.READ_ONLY, index("key"), bufferLength)
var lineLength = 0 // or minLineLength
while (buffer.get(lineLength) != '\n') {
  lineLength += 1
}
buffer.get(lineBuffer, 0, lineLength - 1)
println(new String(lineBuffer, Charset.forName("UTF8")))
Tilo
  • 3,255
  • 26
  • 31
  • I have an index, so I can access the beginning of a line. I access this index and then seek to there.With other options different than RandomAccessFile I would like to seek to this position too, the index will be used too. – Marcelo Machado Feb 21 '17 at 19:25
  • I have read all the file before and then created an index. I put thisindex in memory so I can access this and go to the beginning of a line with seek mothod. With other options different than RandomAccessFile I would like to seek to this position too, the index will be used too – Marcelo Machado Feb 21 '17 at 19:31
  • No, I can't put that file in memory, it's more than 100GB. My solution works, but it's slow and that is my problem. – Marcelo Machado Feb 21 '17 at 20:20
  • I don't have the line size, i want to go to the beginning of the line(by a seek) and then get the whole line that ends with \n. – Marcelo Machado Feb 21 '17 at 21:23
  • @Marcelo Machado I hope my edited answer makes it clearer how to use the approach for large files. – Tilo Feb 21 '17 at 21:32
  • What exactly I have to put in pageSize, is preferd to put my file size? for example 6,7GB? – Marcelo Machado Feb 21 '17 at 21:53
  • I'd like to scan ahead untill to hit an \n, but Now I am still trying to seek the position. Could please put this code? – Marcelo Machado Feb 21 '17 at 21:58
  • Page size cannot be bigger than Integer.MAX_INT, since the parameters for `ByteBuffer.get` are integers. – Tilo Feb 21 '17 at 22:28
  • I tried but I definitly didn't could seek with this: position = keyIndex / pageSize * pageSize and then call buffer.position from that index: keyIndex - position. By the way, my index values is Long type. – Marcelo Machado Feb 21 '17 at 22:42
  • You would use `keyIndex / pageSize * pageSize` in the call to `channel.map` and `keyIndex - position` in the call to `buffer.position`. You still have to handle the case where an entry expands over two pages. One way to do that is to use half the page size for calculating `position`. But if you find this hard to follow, you might want to start with the other way, where you just put your `index("key")` call into `channel.map` and don't call `buffer.position` at all - like in my Scala example. – Tilo Feb 22 '17 at 17:29