5

I have a large binary file (12 GB) from which I want to assemble a smaller binary file (16 KB) on the fly. Assume the file is on disk, and that the bytes for the smaller file are somewhat randomly distributed in the large binary file. What's the best and fastest way to do this? So far I've not been able to do better than about three minutes.

Things I've tried, which have more or less the same performance:

  1. Converting the file to the HDF5 format and using the C interface (slow).
  2. Writing a small C program to fseek() through the file (slow).

How can I randomly access this data really fast?

I want to get to less than a couple of seconds for the query.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Genausactly
  • 193
  • 1
  • 4
  • I doubt it is physically possible to randomly read 12G and write them back in a couple seconds. – Jacob Jul 11 '11 at 14:19
  • Can you give more details? Do you need to scan all 12GB to locate the bytes for the smaller file. or is there an algorithm/header/chain/whatever that tells you where they are? Your slow 'fseek' program would help explain more... – Roddy Jul 11 '11 at 14:57
  • 3 minutes for a single 16kb file or to split the whole 12gb into 16kb chunks? – Vinicius Kamakura Jul 11 '11 at 15:01

7 Answers7

15

The answer is basically "no".

A single mechanical disk drive is going to take 10 ms or so to perform a seek, because it has to move the disk head. 16000 seeks times 10 milliseconds per seek equals 160 seconds. It makes absolutely no difference how you write your code; e.g. mmap() is going to make no difference.

Welcome to the physical world, software person :-). You must improve the locality of your operations.

First, sort the locations you are accessing. Nearby locations in the file are likely to be nearby on disk, and seeking between nearby locations is faster than seeking randomly.

Next, your disk can probably read sequential data at around 100 megabytes/second; that is, it can read 1 megabyte sequentially in around the same time it takes to perform a seek. So if two of your values are less than 1 megabyte apart, you are better off reading all of the data between them than performing the seek between them. (But benchmark this to find the optimal trade-off on your hardware.)

Finally, a RAID can help with throughput (but not seek time). It can also provide multiple disk heads that can seek concurrently if you want to multi-thread your read code.

But in general, accessing random data is about the worst thing you can ask your computer to do, whether in memory or on disk. And the relative difference between sequential access and random access increases every year because physics is local. (Well, the physics we depend on here, anyway.)

[edit]

@JeremyP's suggestion to use SSDs is a good one. If they are an option, they have an effective seek time of 0.1 ms or so. Meaning you could expect your code to run 50-100 times faster on such hardware. (I did not think of this because I usually work with files in the 1 TB range where SSDs would be too expensive.)

[edit 2]

As @FrankH mentions in a comment, some of my suggestions assume that the file is contiguous on disk, which of course is not guaranteed. You can help to improve this by using a good file system (e.g. XFS) and by giving "hints" at file creation time (e.g. use posix_fallocate to inform the kernel you intend to populate a large file).

Community
  • 1
  • 1
Nemo
  • 70,042
  • 10
  • 116
  • 153
  • good comment, particularly also the note about seek distance, and that a big read (with data throwaway) might beat two small reads (with seek in-between). Also worth mentioning that for files in filesystems, logically contiguous data within a file might not map to physically contiguous blocks on disk; hence either strategies to make that file contiguous to start with, or else at least blocking reads that take some filesystem-provided size hints into account might be a good idea. – FrankH. Jul 11 '11 at 16:03
  • Excellent response. It wasn't clear at the moment, but it's obvious now that my speeds are limited by seek time: 1024 * 768 * (10 milliseconds) = 2.18 hours... and I wanted to do this in real time! If I load the whole 12G file into memory, I can get the data I'm looking for in about 5 seconds... still a little slow. The problem was solved by breaking the file into manageable chunks and loading those chunks into memory on many different machines using MPI to coordinate the transfers. The latency using this operation comes out to be less than a second. – Genausactly Jul 14 '11 at 06:34
6

Well, the speed you can achieve for this largely depends on the total number of read operations you perform in order to extract the 96 kB which make up the payload for your new file.

Why is that so? Because random reads from (spinning) disks are seek-limited; the read as such is (almost) infinitely fast compared to the time it takes to re-position the magnetic heads.

Since you're saying the access pattern is random, you're also not likely to benefit from any readahead that the operating system may decide to use; you can, if you so choose, therefore switch that off via fadvise(fd, 0, MAX_OFFSET, FADV_RANDOM); on the filedescriptor for the big file. Or, madvise() if you've chosen to mmap() it. But that'll only gain you if you're performing big reads (and you know a big readahead would be nonsense). For small reads, it's exclusively the seek time that'll determine the total.

Assuming you need N random reads and you've got an M msec seek time, it'll take at least N * m milliseconds to perform the data extraction (if you've got the disk to yourself ...). There is no way to break this barrier.

Edit: A few things on mitigating strategies:

As mentioned by several people, the key to approach this problem is to minimize seeks. There are several strategies for this:

  1. Issue asynchronous reads if you can (i.e. if read operation N+1 doesn't depend on what read operation N did, then you can issue both concurrently). This allows the operating system / device driver to queue them up and possibly re-order them (or merge them with reads done by other concurrently running processes) for most efficient seeking.
  2. If you know the positions all in advance, then perform scatter-gather I/O (the UN*X preadv() would come to mind), to the same effect.
  3. Query your filesystem and/or block device for the best / minimum blocksize; how to do this is system-dependent, see e.g. statvfs() or even ioctl_list. If you know that, you might possibly use the technique mentioned by Nemo (merge two small reads within the "optimal" block size into a single large read, needing no seek).
  4. Possibly even use query interfaces like FIEMAP / FIBMAP (the Windows equivalent would roughly be FSCTL_GET_RETRIEVAL_POINTERS) to determine where the physical blocks for your file data are, and perform a decision on read merging based on that (there's no point issuing a large "nonseeking" read if actually that crosses a physical block boundary and the filesystem turns it into two).
  5. If you build up the positions to read from over a comparatively large time, then reading (asynchronously) as you still compute future read offsets will also help to hide seek latency, as you're putting compute cycles / wait time to good use.

In general, if none of the above applies, you'll have to bite the bullet and accept the seek latency. Buy a solid state disk and/or use a RAM-backed filesystem if you can justify the costs (and/or the volatility of RAM).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
FrankH.
  • 17,675
  • 3
  • 44
  • 63
1

Have you tried mmaping the file? (in your case, mmap64). This will lazy-read the data from disk as you access it.

If you're having to seek through the entire file to find the data you're looking for you'll be able to speed it up with a SSD, but it's always going to be slow. Are the locations of the data you're looking for known ahead of time?

Is the file a text file, or a binary file?

Dave
  • 31
  • 4
1

If you have to read the whole file and you are using a mechanical hard disk, you are screwed. Assume the transfer rate is about 1 Gigabit / second, that means you physically can't get all the bits across the bus in less than 12 x 8 = 96 seconds. That assumes there is no seek time and the processor can deal with the data as it comes in.

Since transfer rate is limited by the speed of the drive as much as anything, even if you know exactly where every byte of data you want to read is, if they are spread out randomly across the file, it'll still take about as long because you have to wait for the disk to rotate until the next byte you want is under the head.

If you have an SSD you can probably improve on this dramatically, since there's no waiting for the bytes to come round under the head...

Nemo
  • 70,042
  • 10
  • 116
  • 153
JeremyP
  • 84,577
  • 15
  • 123
  • 161
  • But he is only trying to read 16k. The _size_ of the larger file is irrelevant; this process would take exactly as long if the file were 2 times, 10 times, or 100 times larger. – Nemo Jul 11 '11 at 15:46
  • @Nemo: Did you **read** the question? He says "Assume the file is on disk, and that **the bytes for the smaller file are somewhat randomly distributed in the large binary file**". If the 16k bytes were all in one place, I'd agree with you, it's one seek and then one read of 16 k. However, it's not and with a mechanical hard disk, you have to wait for the disk to rotate until the block containing each byte is under the read head. – JeremyP Jul 11 '11 at 15:49
  • Did *you* read the question? He is trying to read 16k by _seeking_ before each read. So he is _not_ trying to read the entire file; he is trying to read 16k total. So yes, the transfer rate is irrelevant, and his code would take exactly the same amount of time if the file were 12 terabytes. The only relevant number here is seek time, not transfer rate, so your answer as presently written is just wrong. – Nemo Jul 11 '11 at 15:55
  • @Nemo: Did you read the second paragraph of my answer where I say **Since transfer rate is limited by the speed of the drive as much as anything**. Do you understand what that means? It means, it makes no difference to how long it takes to read one byte and then another byte half way round the disk whether you read all the bytes in between or not. The disk doesn't speed up when it is not reading. – JeremyP Jul 11 '11 at 16:02
  • Yeah, OK. Plus your SSD suggestion is pretty good, so I have removed by downvote. But your first paragraph is still nonsense; his question makes it very clear he does not need to read the whole file, so that math is irrelevant. And it is also false that random reads will take "about as long" as reading the whole file. They are completely unrelated numbers. – Nemo Jul 11 '11 at 16:05
  • P.S. At least half of the random-access time consists in moving the disk head itself toward/away from the center of the disk; the other half is waiting for the disk to rotate. So even if the disk spun infinitely fast (rotational latency = zero), the seek time would still be measured in milliseconds. – Nemo Jul 11 '11 at 16:56
  • @Nemo: Thanks for removing the down vote. The first paragraph is there just to give an idea of the magnitude of the task he is trying to do. It's relevant in the sense that, one way or another, he is trying to traverse a file that is very large and he is probably limited by the physical characteristics of the device. – JeremyP Jul 12 '11 at 11:02
  • @Nemo: Oh, and I did think about discussing seek times which is probably a major contributor to his current measurement of 3 minutes, but I had a train to catch. – JeremyP Jul 12 '11 at 11:04
0

I suppose it depends on how many seeks you need to do. 16 thousand, or a smaller number? Can you store the 12 GB file on a solid state drive? That would cut down on the seek latencies.

Can you break up the file and store the pieces on separate hard drives? That would enable asynchronous seeking in parallel.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
0

Some hints to speedup reading files a little (besides what was already said): - read chunks which are multiplied size of block - on POSIX compliant systems use posix_fadvise(), which advice to OS about paging.

Maciej
  • 3,685
  • 1
  • 19
  • 14
0

Use parallel or asynchronous reads. Issue them from multiple threads, processes, etc. as necessary, or use preadv, just like FrankH said.

This means that you won't have to wait for one I/O request to complete before the next one can come along, which is going to improve performance if you have a clever RAID controller and lots of spindles.

On the other hand, if you have a really stupid I/O subsystem, it may make only a minor difference. Consider which I/O scheduler to use (you can change them on the fly, without a reboot, which is really cool). Anecdotal evidence suggests "noop" is best if you have "smart" hardware, cfq or deadline, if you have stupid hardware.

Community
  • 1
  • 1
MarkR
  • 62,604
  • 14
  • 116
  • 151