0

I'm creating an archiver, which processes data in two steps:

  1. creates a temporary archive file
  2. from the temporary archive file, it creates the final archive. After the final archive created, temporary file is deleted

The 2nd step processes temporary archive file linearly, and the result is written to the final archive while processing. So this process needs twice as much storage (temporarily) as the archive file.

I'd like to avoid the double storage need. So my idea is that during processing, I'd like to tell the OS that it can drop the processed part of the temporary file. Like a truncate call, but it should truncate the file at the beginning, not the end. Is is possible to do something like this?

geza
  • 28,403
  • 6
  • 61
  • 135
  • Do you really need the temporary file? Why not directly process the output of the first step? Standard Unix/Linux way is to have small programs which do one thing, and chain them: `tar c ... | gzip > archive.tar.gz` – Karsten Koop Dec 11 '17 at 13:32
  • @KarstenKoop: yes, because this process cannot be done in one step, nor with pipes. The main reason is that I'd like to put a table-of-contents (which is compressed as well) at the beginning of the file. – geza Dec 11 '17 at 13:36
  • 1
    What kind of archive file? What is "this process"? Without knowing what you're trying to do, nor your starting and desired ending states, it's really hard to answer your question. – Andrew Henle Dec 11 '17 at 13:44
  • @AndrewHenle: it's a new format. Why does the process matter (it's complicated to describe)? I'm reading the tmp file linearly (from the beginning to the end), and I'd like to have the read part discarded by the OS. Just like if was reading the file backwards, then I could call `ftruncate` with smaller and smaller numbers. – geza Dec 11 '17 at 13:50
  • Maybe you want some [SQLite](http://sqlite.org/) database, or you want to keep all the data (if not too big) in memory. But your question should mention the approximate data size and its form. Handling petabytes is not the same as handling a few megabytes. Explain why you want to avoid the double storage need (disk space is really cheap today). – Basile Starynkevitch Dec 11 '17 at 15:01
  • I've found a detailed research on the topic in the CLI case: http://backreference.org/2011/01/29/in-place-editing-of-files/ – Velkan Dec 11 '17 at 15:06
  • https://stackoverflow.com/questions/18072180/truncating-the-first-100mb-of-a-file-in-linux – William Pursell Dec 11 '17 at 15:07

2 Answers2

3

Write all the data. Shift the data by opening file twice: for reading and for writing in overwrite mode (inject the table of contents and make sure that you're not overwriting before reading a chunk).

If the table of contents has fixed length - then preallocate that size in the file to avoid shifting completely.

Velkan
  • 7,067
  • 6
  • 43
  • 87
  • Good idea, but unfortunately it is complicated to do in my case (processed data could be a little bit larger than the unprocessed, the toc itself takes space, so I'd need to manage that I never overwrite data that's needed). Toc has a variable length (its compressed), so unfortunately cannot be preallocated. – geza Dec 11 '17 at 16:58
1

Like a truncate call, but it should truncate the file at the beginning, not the end. Is is possible to do something like this?

No, that is not possible with plain files. However, look into the Linux specific fallocate(2) (which is not portable, and might not work with every file system), so I don't recommend using it.

However, look into SQLite and GDBM indexed files. They provide an abstraction (above files) which enables you to "delete records".

Or just keep temporarily all the data in memory.

Or consider a double-pass (or multiple-pass) approach. Maybe nftw(3) could be useful.

(today, disk space is very cheap, so your requirement of avoid the double storage need is really strange; if you handle a huge amount of data you should have mentioned it)

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547