3

My program needs to read chunks from a huge binary file with random access. I have got a list of offsets and lengths which may have several thousand entries. The user selects an entry and the program seeks to the offset and reads length bytes.

The program internally uses a TMemoryStream to store and process the chunks read from the file. Reading the data is done via a TFileStream like this:

FileStream.Position := Offset;
MemoryStream.CopyFrom(FileStream, Size);

This works fine but unfortunately it becomes increasingly slower as the files get larger. The file size starts at a few megabytes but frequently reaches several tens of gigabytes. The chunks read are around 100 kbytes in size.

The file's content is only read by my program. It is the only program accessing the file at the time. Also the files are stored locally so this is not a network issue.

I am using Delphi 2007 on a Windows XP box.

What can I do to speed up this file access?

edit:

  • The file access is slow for large files, regardless of which part of the file is being read.
  • The program usually does not read the file sequentially. The order of the chunks is user driven and cannot be predicted.
  • It is always slower to read a chunk from a large file than to read an equally large chunk from a small file.
  • I am talking about the performance for reading a chunk from the file, not about the overall time it takes to process a whole file. The latter would obviously take longer for larger files, but that's not the issue here.

I need to apologize to everybody: After I implemented file access using a memory mapped file as suggested it turned out that it did not make much of a difference. But it also turned out after I added some more timing code that it is not the file access that slows down the program. The file access takes actually nearly constant time regardless of the file size. Some part of the user interface (which I have yet to identify) seems to have a performance problem with large amounts of data and somehow I failed to see the difference when I first timed the processes.

I am sorry for being sloppy in identifying the bottleneck.

dummzeuch
  • 10,975
  • 4
  • 51
  • 158
  • 1
    I don't see anything obvious. These stream classes are just wrappers around the system file I/O functions. How could you improve significantly on those for random access patterns? – David Heffernan Jan 06 '11 at 13:25
  • Are you saying that a single seek/read is noticeably slower to the user? Or that a large "batch" of these operations is slower? A single seek and read operation should be the about the same regardless of file size if the data is coming from the disk. With a 7200 rpm disk, that should be maybe between 5 and 10 ms. – Mark Wilkins Jan 06 '11 at 14:30
  • It might be a memory fragmentation problem. Are you freeing the TMemoryStream between operations? Try keeping it alive for the life of the application and see if the slow-down disappears. – Cosmin Prund Jan 06 '11 at 14:36
  • @mark: It's the single operation of reading a chunk from a file that is much slower for large files than for smaller ones. (added this info to the question) – dummzeuch Jan 06 '11 at 14:56
  • @Cosmin: It does not get slower over the time the program runs but starts slow for large files. – dummzeuch Jan 06 '11 at 14:57
  • This sounds like an API issue; I'd try to get this answered by http://blogs.msdn.com/b/oldnewthing/ – Jeroen Wiert Pluimers Jan 06 '11 at 19:14

3 Answers3

3

If you open help topic for CreateFile() WinAPI function, you will find interesting flags there such as FILE_FLAG_NO_BUFFERING and FILE_FLAG_RANDOM_ACCESS . You can play with them to gain some performance.

Next, copying the file data, even 100Kb in size, is an extra step which slows down operations. It is a good idea to use CreateFileMapping and MapViewOfFile functions to get the ready for use pointer to the data. This way you avoid copying and also possibly get certain performance benefits (but you need to measure speed carefully).

Eugene Mayevski 'Callback
  • 45,135
  • 8
  • 71
  • 121
  • Since copying the data happens regardless of the file size, it cannot be the bottleneck. – dummzeuch Jan 06 '11 at 14:19
  • @dummzeuch who said it does? You get the pointer to mapped memory. You don't need to copy that way and can access the mapped memory directly. MMF saves one reading (at least) – Eugene Mayevski 'Callback Jan 06 '11 at 15:49
  • Agreed. MMF is much faster then using plain file I/O. In one of my projects, I have to open binary log files that can be up to several GBs in size. To seek through such files using random file I/O can take minutes, whereas doing the same job with MMF works in a fraction of the time. – Remy Lebeau Jan 06 '11 at 21:14
  • Hm, so you both are saying that using a memory mapped file is always faster than normal file I/O? If that's true, it should also be possible to copy that memory to my memory stream and still having that speed advantage, so I don't have to change too much to test it out. – dummzeuch Jan 07 '11 at 07:41
  • @dummzeuch let me repeat again - don't copy the memory to memory stream! It is possible to construct the memory stream pointing at existing memory block, by setting pointer and size properties of TMemoryStream object. – Eugene Mayevski 'Callback Jan 07 '11 at 08:03
  • @Eugene: Yes, I understood what you mean. But there is some more code in the program that splits the data chunk it has read from the file into two parts and inserts some additional data in between, so it can't just work on the memory into which the file has been mapped. (But see my latest edit regarding the real problem with the program.) – dummzeuch Jan 07 '11 at 10:23
0

Maybe you can take this approach:

Sort the entries on max fileposition and then to the following:

  1. Take the entries that only need the first X MB of the file (till a certain fileposition)
  2. Read X MB from the file into a buffer (TMemorystream
  3. Now read the entries from the buffer (maybe multithreaded)
  4. Repeat this for all the entries.

In short: cache a part of the file and read all entries that fit into it (multhithreaded), then cache the next part etc.

Maybe you can gain speed if you just take your original approach, but sort the entries on position.

The_Fox
  • 6,992
  • 2
  • 43
  • 69
0

The stock TMemoryStream in Delphi is slow due to the way it allocates memory. The NexusDB company has TnxMemoryStream which is much more efficient. There might be some free ones out there that work better.

The stock Delphi TFileStream is also not the most efficient component. Wayback in history Julian Bucknall published a component named BufferedFileStream in a magazine or somewhere that worked with file streams very efficiently.

Good luck.

Terry
  • 1