5

The task is to parse a binary file into memory. However, I don't know a priori the amount of memory needed to allocate.

Which method has shown to be preferable: doing multiple small mallocs as I progress in the parsing routine or, first traverse the file in order to decide the amount of memory needed, and then parse again?

Any hint is appreciated.

Carsos
  • 93
  • 1
  • 4
  • 1
    did you try exponential approach? – arunmoezhi Jul 25 '12 at 00:10
  • Trading off space inefficiency for the advantage of time efficiency, can't you initially allocate based on [stat()](http://linux.die.net/man/2/stat) or its equivalent, and avoid even an initial traversal? – Keith Flower Jul 25 '12 at 00:15
  • @KeithFlower Sounds like it's talking about the in-memory structure, not just the raw bytes. –  Jul 25 '12 at 00:15
  • @pst - hmmm, yeah, I assumed parsed-structure-in-memory <= file-size. May not necessarily be the case, I guess. – Keith Flower Jul 25 '12 at 00:18
  • I'm intrigued - you say you're parsing it into memory - do you mean parsing it into lots of smaller variably sized structures? If so, how would you allocate them in one malloc anyway? – ChrisH Jul 25 '12 at 00:43
  • Parsed structure in memory should be bounded by a simple (almost certainly linear) function of the file size. Unless that bound is really bad, I'd use that as the amount of memory to allocate. – R.. GitHub STOP HELPING ICE Jul 25 '12 at 00:44
  • @ChrisH: You have a point. However, in this particular task, I'm filling an array of structs, and it's the size of this array that I don't know. Hence, I could do reallocs as the process goes, or, first find out the size I need and then, just one malloc. Does this make sense to you? – Carsos Jul 25 '12 at 09:22
  • @R..: I could do that, but in this particular task the worst case scenario would be allocating 9*file-size bytes, and I don't see that as a good boundary. Would you think otherwise? – Carsos Jul 25 '12 at 09:25
  • I think it depends on how big files are. If they're at most a few tens or hundreds of kb, it might be no problem, especially if your system isn't under heavy load. If they could be hundreds of megs, probably not such a good idea... – R.. GitHub STOP HELPING ICE Jul 25 '12 at 15:20

5 Answers5

6

In almost all cases one large allocation is better than many small allocations. This prevents fragmentation, makes less system calls. It often results in better performance through better locality.

A common technique is to allocate a small segment first and reallocate one larger by a fixed factor (often 1.5). After all elements are gathered, the memory can be fixed to largest size if the over-allocation is considered to big.

In any case: Implement the simplest one first. If you have performance problems: benchmark. Then optimize. It might turn out that allocation isn't even your bottleneck.

EDIT: As R.. mentions you might get a good idea how much to allocate by reasoning about the upper memory bound and its relation to file length. Most good binary formats also contain length and size information in a header segment. If you can figure out the exact size required by your data structure with a little arithmetic and/or file seeking, you are on the winning side.

pmr
  • 58,701
  • 10
  • 113
  • 156
  • Reallocation can be a pain if the structures contain pointers to one another - all the pointers will be invalidated if the allocation is moved to a new address. Moreover, it's technically even UB to do the arithmetic to fix them up after relocation. I would think it makes more sense just to `malloc` space for the additional data, rather than calling `realloc` on the original block, and possibly having all the allocated blocks accounted for as a linked list starting at the beginning of the first one so you can easily free them later. – R.. GitHub STOP HELPING ICE Jul 25 '12 at 00:47
  • @R.. Yes, all of this gets a lot harder if you have referential data. As I said: Go for the simplest thing first, optimize later. – pmr Jul 25 '12 at 00:52
  • I think the simplest solution is to make some observation like that the size of the in-memory representation is bounded by 3 times the file length, then allocate that much memory. :-) If you want to be able to free the unused part once you're done, use `mmap` to get anonymous pages then you can `munmap` the part you don't need (where as you cannot use `realloc` to shrink an allocation without risking moving it to a new address). – R.. GitHub STOP HELPING ICE Jul 25 '12 at 01:00
  • By the way, the reason I've thought of these things is from the standpoint of how one would go about designing a web browser that doesn't suck. A single allocation for a web page (with additional allocations for later DOM modifications) would make a huge difference to memory usage/fragmentation/contention compared to typical current browsers that call `malloc` thousands, tens, or hundreds of thousands of times per page load. – R.. GitHub STOP HELPING ICE Jul 25 '12 at 01:04
  • @R.. I tried integrating your thoughts into the answer. I hope I did them justice. – pmr Jul 25 '12 at 01:06
3

Have you looked at maybe using an mmap() for this? See this link for more info. Basically you just map your file into a memory and access it as if it was a memory block, avoiding malloc()s altogether.

YePhIcK
  • 5,816
  • 2
  • 27
  • 52
  • Interesting, I didn't know that existed. But I'm afraid I can't use it here, for I have to parse the file, not simply load it. Nevertheless, I learned something. Thanks – Carsos Jul 25 '12 at 00:16
  • 2
    +1 for mmap — the most reasonable solution. @Carsos: It works perfectly in your case — you avoid a writing lot code, worrying about memory (OS will read, map into virtual memory, pre-fectch etc.), and you can simply parse it as if you were using smaller buffers. –  Jul 25 '12 at 00:19
  • @VladLazarenko: The parsing is simple, and the input file is small, but it's the amount of information I can extract that's (relatively) huge and (needs?) care concerning mallocs. – Carsos Jul 25 '12 at 09:29
  • With mmap there are no mallocs. Have you considered reading a doc? It even got examples... –  Jul 25 '12 at 11:51
  • @VladLazarenko: Thanks for keeping up, but I fail to see why you insist on `mmap'. My input's format is "linear" i.e. once I parse some unit, I need only move forward to keep parsing. Make sense? So, my problem is regarding to how much memory to allocate (and I need to do that) to support the extracted information from my input. Regards – Carsos Jul 25 '12 at 12:56
  • The use of `mmap()` eliminates the need for memory management (in your case) altogether. Whether it is a "read once" or "random access" - the `mmap()` will work the same for you and the OS will take care of not wasting the memory used for the operation. In fact on most systems the use of `mmap()` for reading a file will allow OS to "just" treat the file as a region of memory, utilizing the same routines that deal with pagination (at application level you can't do any better than that!) – YePhIcK Sep 25 '16 at 06:54
  • I think @Carsos made clear that malloc() is not being called to store parts of the file but to store the results of parsing whose size is unknown. Say the file contains a graph with n vertices, and you want all cliques (for whatever reason), parsing the file is trivial, but you may need 2^n small allocations for each clique (in the case of the complete graph). For this mmap is useless and the idea of allocating a big block is actually a useful to reduce the number of calls to malloc (assuming you don't want to free() cliques individually). – santileortiz Jan 02 '18 at 07:16
2

This is a classic time-space tradeoff. Allocating lots of small blocks is likely to be less efficient than one big block, assuming you need the entire contents.

Ideally, the file format should encode metadata such as the size of blocks, the count of chunks, and so on. Given the latency of disk access compared to the speed of memory, reading through the file to determine the required size would likely take longer.

The most efficient approach also depends on how much processing is required. You mention parsing, but it is a binary file. Presumably there are many chunks and variable-sized structures you need to traverse?

There are a few strategies you can try:

  • If the files are not too large to fit in memory, you could query the filesystem to see how big the file is, read it in as one big chunk, then pull it apart in memory. This would be very fast, but use lots of memory.

  • Depending on the structure of the binary file, you might be able to do a few fseek() calls to figure out how big the chunks you need to read are (if you don't need the entire file) and just read those.

  • You could use mmap() to map the file into memory and let the runtime manage paging of the data into memory.

gavinb
  • 19,278
  • 3
  • 45
  • 60
  • Oh and I would like to echo what pmr and Gene said - implement the simplest and cleanest method first, and then see what sort of performance you get. Premature optimisation... – gavinb Jul 25 '12 at 00:21
1

Traversing a file in order to determine its size and amount of memory you need is definitely not the way to go — disk I/O is extremely expensive.

Another option would be to get the file size and then allocate memory. For details on how to get a file size, see this Q/A. However, this method is not efficient either.

All in all, it actually depends on how you read data and how you parse it. For example, having a few reasonably large chunks of data along with asynchronous file I/O might work out best for you. But that is relatively complex task to implement.

Probably the easiest and very efficient thing to start with would be to use mmap and "map" the contents of a file into memory.

Community
  • 1
  • 1
1

There is no general answer at least in part because you don't define "preferable." Simplest? Fastest? Requires least heap? Also, what do you mean by "parse a binary file"? Parsing is normally something done to human-readable text in order to create a data structure.

Each malloc normally has a small overhead. However unless the final data structure is huge, it's unlikely to make any significant difference.

Do what produces the clearest code, with clean interfaces so you can substitute allocation methods later. Then worry about optimizing only after you know there's a problem.

Gene
  • 46,253
  • 4
  • 58
  • 96