7

I have a program that reads and writes very large text files. However, because of the format of these files (they are ASCII representations of what should have been binary data), these files are actually very easily compressed. For example, some of these files are over 10GB in size, but gzip achieves 95% compression.

I can't modify the program but disk space is precious, so I need to set up a way that it can read and write these files while they're being transparently compressed and decompressed.

The program can only read and write files, so as far as I understand, I need to set up a named pipe for both input and output. Some people are suggesting a compressed filesystem instead, which seems like it would work, too. How do I make either work?

Technical information: I'm on a modern Linux. The program reads a separate input and output file. It reads through the input file in order, though twice. It writes the output file in order.

A. Rex
  • 31,633
  • 21
  • 89
  • 96
  • Feel free to edit my tags. I found it very difficult to choose appropriate ones. Also, if this is a duplicate, as always, let me know and I'll be happy to delete ... – A. Rex Apr 16 '09 at 07:57
  • 1
    this isn't programming related, as you can't change your program. you either need bigger disks, or a r/w compressed file system. – Alnitak Apr 16 '09 at 08:30

5 Answers5

5

Check out zlibc: http://zlibc.linux.lu/.

Also, if FUSE is an option (i.e. the kernel is not too old), consider: compFUSEd http://www.biggerbytes.be/

EFraim
  • 12,811
  • 4
  • 46
  • 62
  • Can I write with zlibc, too? It's as crucial that I can write as read. – A. Rex Apr 16 '09 at 08:13
  • zlibc is mainly for writing new programs that compress, and you said you couldn't touch your program. I voted this one up for the mention of compuFUSEd, that sounds like a good fit for your problem. – unwind Apr 16 '09 at 08:21
  • zlibc is read-only, but definitely can be used without recompiling too, through LD_PRELOAD mechanism. – EFraim Apr 16 '09 at 08:31
  • Dead link for compFUSEd and I couldn't find a replacement. – Ken Sharp Feb 25 '15 at 22:40
  • @KenSharp Perhaps, https://code.google.com/p/fusecompress/wiki/Usage ? Or something from the link given in the answer: http://stackoverflow.com/a/755497/94687 ? Or http://www.lessfs.com/wordpress/ described at http://www.phoronix.com/scan.php?page=news_item&px=MTA0MzQ ? – imz -- Ivan Zakharyaschev Jun 19 '15 at 10:44
2

btrfs:

https://btrfs.wiki.kernel.org/index.php/Main_Page

provides support for pretty fast "automatic transparent compression/decompression" these days, and is present (though marked experimental) in newer kernels.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
2

named pipes won't give you full duplex operations, so it will be a little bit more complicated if you need to provide just one filename.

Do you know if your applications needs to seek through the file ?

Does your application work with stdin, stdout ?

Maybe a solution is to create a mini compressed file system that contains only a directory with your files

Since you have separate input and output file you can do the following :

mkfifo readfifo
mkfifo writefifo
zcat your inputfile > readfifo &
gzip writefifo > youroutputfile &

launch your program !

Now, you probably will get in trouble with the read twice in order of the input, because as soon as zcat is finished reading the input file, yout program will get a SIGPIPE signal

The proper solution is probably to use a compressed file system like CompFUSE, because then you don't have to worry about unsupported operations like seek.

shodanex
  • 14,975
  • 11
  • 57
  • 91
  • I've edited my question to address your inquiries. The program does not read or write stdin/out. – A. Rex Apr 16 '09 at 08:11
1

FUSE options: http://apps.sourceforge.net/mediawiki/fuse/index.php?title=CompressedFileSystems

Jim T
  • 12,336
  • 5
  • 29
  • 43
0

Which language are you using?

If you are using Java, take a look at GZipInputStream and GZipOutputStream classes in the API doc.

If you are using C/C++, zlibc is probably the best way to go about it.

trshiv
  • 2,445
  • 21
  • 24
  • I cannot change the program, so this must work outside of the program. I'm cool with any language, but I thought this was more working with Linux than any programming. – A. Rex Apr 16 '09 at 08:15