15

I am using Python multiprocessing to generate a temporary output file per process. They can be several GBs in size and I make several tens of these. These temporary files need to be concated to form the desired output and this is the step that is proving to be a bottleneck (and a parallelism killer). Is there a Linux tool that will create the concated file by modifying the file-system meta-data and not actually copy the content ? As long as it works on any Linux system that would be acceptable to me. But a file system specific solution wont be of much help.

I am not OS or CS trained, but in theory it seems it should be possible to create a new inode and copy over the inode pointer structure from the inode of the files I want to copy from, and then unlink those inodes. Is there any utility that will do this ? Given the surfeit of well thought out unix utilities I fully expected it to be, but could not find anything. Hence my question on SO. The file system is on a block device, a hard disk actually, in case this information matters. I dont have the confidence to write this on my own, as I have never done any systems level programming before, so any pointers (to C/Python code snipppets) will be very helpful.

san
  • 4,144
  • 6
  • 32
  • 50
  • @san: By way of background, please could you say a few words about why the final output *has* to be a single file. – NPE May 05 '11 at 06:40
  • @aix Its input to another piece of code that I do not have control over. – san May 05 '11 at 06:41
  • @san: Do you know in advance the size of each temporary file? – NPE May 05 '11 at 06:42
  • @san: can't you just provide `cat file1 ... fileN |` on stdin of the next process instead of a regular file? – Marc Mutz - mmutz May 05 '11 at 06:48
  • @aix no I dont. Its not a fixed number for each process either. I might be willing to take a risky guess at an upperbound. But if there is a way to do it without it that would be nice. – san May 05 '11 at 06:49
  • @mmutz would it be any faster than system calling cat on those files. Thats what I do now rather than copying/concatenating the file's contents from inside Python. The program does not read from stdin, but i think I can create a named pipe to deal with that. – san May 05 '11 at 06:51
  • @san: yes, because it would produce the input stream on-the-fly. E.g. you would not need twice the storage capacity to hold the temporaries _and_ the final file. – Marc Mutz - mmutz May 05 '11 at 06:54
  • @mmutz Ah I see. A generator by the OS. This looks like it will work. Have to get the named pipe working. The buffering issues mess it up sometimes. But thanks. Could you add this to your answer. – san May 05 '11 at 06:57

6 Answers6

15

Even if there was such a tool, this could only work if the files except the last were guaranteed to have a size that is a multiple of the filesystem's block size.

If you control how the data is written into the temporary files, and you know how large each one will be, you can instead do the following

  1. Before starting the multiprocessing, create the final output file, and grow it to the final size by fseek()ing to the end, this will create a sparse file.

  2. Start multiprocessing, handing each process the FD and the offset into its particular slice of the file.

This way, the processes will collaboratively fill the single output file, removing the need to cat them together later.

EDIT

If you can't predict the size of the individual files, but the consumer of the final file can work with sequential (as opposed to random-access) input, you can feed cat tmpfile1 .. tmpfileN to the consumer, either on stdin

cat tmpfile1 ... tmpfileN | consumer

or via named pipes (using bash's Process Substitution):

consumer <(cat tmpfile1 ... tmpfileN)
Zombo
  • 1
  • 62
  • 391
  • 407
Marc Mutz - mmutz
  • 24,485
  • 12
  • 80
  • 90
  • +1 been awhile since I've seen anyone reference sparse files. – dietbuddha May 05 '11 at 06:44
  • Added a comment about this a moment ago. My main problem is (i) I do not have a guaranteed upper bound on how much each process will write and (ii) dont know how the consumer of this file will deal with the holes. If there was a way to detect and remove holes, it might still be worth a shot. – san May 05 '11 at 06:46
5

You indicate that you don't know in advance the size of each temporary file. With this in mind, I think your best bet is to write a FUSE filesystem that would present the chunks as a single large file, while keeping them as individual files on the underlying filesystem.

In this solution, your producing and consuming apps remain unchanged. The producers write out a bunch of files that the FUSE layer makes appear as a single file. This virtual file is then presented to the consumer.

FUSE has bindings for a bunch of languages, including Python. If you look at some examples here or here (these are for different bindings), this requires surprisingly little code.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • Thanks. I did not know about fuse before. Hope the overheads are not too high. If it can read 200Gb in around 3 mins I will be happy. – san May 05 '11 at 07:01
  • 1
    @san: FUSE or no FUSE, I'd expect the reading to be disk-bound, so I doubt you'd see any difference in reading performance. – NPE May 05 '11 at 07:09
  • Thanks (and indeed the concatenation that I was doing was I/O bound). Wish there was a way to accept two answers because I like the FUSE and the named pipe solution. I upvoted both, but I can accept only one. – san May 05 '11 at 07:20
3

For 4 files; xaa, xab, xac, xad a fast concatention in bash (as root):

losetup -v -f xaa; losetup -v -f xab; losetup -v -f xac; losetup -v -f xad

(Let's suppose that loop0, loop1, loop2, loop3 are the names of the new device files.)

Put http://pastebin.com/PtEDQH7G into a "join_us" script file. Then you can use it like this:

./join_us /dev/loop{0..3}

Then (if this big file is a film) you can give its ownership to a normal user (chown itsme /dev/mapper/joined) and then he/she can play it via: mplayer /dev/mapper/joined

The cleanup after these (as root):

dmsetup remove joined; losetup -d /dev/loop[0123]
szabozoltan
  • 629
  • 8
  • 13
2

I don't think so, inode may be aligned, so it may only possible if you are ok to leave some zeros (or unknown bytes) between one file's footer and another file's header.

Instead of concatenate these files, I'd like suggest to re-design the analysis tool to support sourcing from multiple files. Take log files for example, many log analyzers support to read log files each for one day.

EDIT

@san: As you say the code in use you can't control, well you can concatenate the separate files on the fly by using named pipes:

$ mkfifo /tmp/cat
$ cat file1 file2 ... >/tmp/cat &
$ user_program /tmp/cat
...
$ rm /tmp/cat
Lenik
  • 13,946
  • 17
  • 75
  • 103
  • could you elaoborate a little bit about the inode alignment problem. – san May 05 '11 at 06:53
  • 1
    A file maybe length of 12345 bytes, if size of inode (you can get it by `sudo tune2fs -l /dev/sda1`) is 4096 bytes, then the file occupies 4 inodes. 4 inodes is 16384 bytes in all, so the last inode is not full filled, thus leaves `4096*4 - 12345 = 4039` unknown bytes at the end. – Lenik May 05 '11 at 07:08
  • 2
    the named fifo seems to be the most convenient solution we are converging on. @mmutz suggested it on this thread. But thanks for the explanation about inode. But you meant block right ? Because iirc (where r=read) each file corresponds to a single inode, but the inode pointer structure refers to multiple blocks on the disk. – san May 05 '11 at 07:15
  • Yes, my mistake, I mean block, not inode! O.O – Lenik May 05 '11 at 07:38
0

No, there is no such tool or syscall.

You might investigate if it's possible for each process to write directly into the final file. Say process 1 writes bytes 0-X, process 2 writes X-2X and so on.

janneb
  • 36,249
  • 2
  • 81
  • 97
  • Yes I did think about this, that is I open a single file and seek to different offsets and right there. I am not sure how the consumer of this file will deal with the holes that are created. Also I do not have a guaranteed upper bound on how much each process will write. – san May 05 '11 at 06:43
0

A potential alternative is to cat all your temp files into a named pipe and then use that named pipe as input to your single-input program. As long as your single-input program just reads the input sequentially and doesn't seek.

Ryan C. Thompson
  • 40,856
  • 28
  • 97
  • 159