0

I am looking to compress a lot of data spread across loads of sub-directories into an archive. I cannot simply use built-in tar functions because I need my Perl script to work in a Windows as well as a Linux environment. I have found the Archive::Tar module but their documentation gives a warning:

Note that this method [create_archive()] does not write on the fly as it were; it still reads all the files into memory before writing out the archive. Consult the FAQ below if this is a problem.

Because of the sheer size of my data, I want to write 'on the fly'. But I cannot find useful information in the FAQ about writing files. They suggest to use the iterator iter():

Returns an iterator function that reads the tar file without loading it all in memory. Each time the function is called it will return the next file in the tarball.

my $next = Archive::Tar->iter( "example.tar.gz", 1, {filter => qr/\.pm$/} );
while( my $f = $next->() ) {
    print $f->name, "\n";
    $f->extract or warn "Extraction failed";
    # ....
}

But this only discusses the reading of files, not the writing of the compressed archive. So my question is, how can I take a directory $dir and recursively add it to an archive archive.tar.bz2 with bzip2 compression in a memory-friendly manner, i.e. without first loading the whole tree in memory?

I tried to build my own script with the suggestions in the comments using Archive::Tar::Streamed and IO::Compress::Bzip2, but to no avail.

use strict;
use warnings;

use Archive::Tar::Streamed;
use File::Spec qw(catfile);
use IO::Compress::Bzip2 qw(bzip2 $Bzip2Error);

my ($in_d, $out_tar, $out_bz2) = @ARGV;

open(my $out_fh,'>', $out_tar) or die "Couldn't create archive";
binmode $out_fh;

my $tar = Archive::Tar::Streamed->new($out_fh);

opendir(my $in_dh, $in_d) or die "Could not opendir '$in_d': $!";
while (my $in_f = readdir $in_dh) {
  next unless ($in_f =~ /\.xml$/);
  print STDOUT "Processing $in_f\r";
  $in_f = File::Spec->catfile($in_d, $in_f);
  $tar->add($in_f);
}

print STDOUT "\nBzip'ing $out_tar\r";

 bzip2 $out_tar => $out_bz2
    or die "Bzip2 failed: $Bzip2Error\n";

Very quickly, my system runs out of memory. I have 32GB available in my current system, but it gets flooded almost immediately. Some files in the directory I am trying to add to the archive exceed 32GB.

Memory exceeded

So I wonder if even in the Streamed class each file has to be read in memory completely before being added to the archive? I assumed the files themselves would be streamed in buffers to the archive, but perhaps it's simply that instead of first saving ALL files in memory, Streamed allows to only need one file in memory completely, and then adding that to the archive, one by one?

Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
  • Related: https://stackoverflow.com/questions/653127/how-can-i-tar-files-larger-than-physical-memory-using-perls-archivetar?rq=1 – melpomene Jul 30 '17 at 13:45
  • What does "across platforms" refer to? Do you need to pull these files from multiple systems? – Borodin Jul 30 '17 at 14:06
  • @Borodin I meant that the script needs to work in Windows as well as Linux. I edited the first paragraph to reflect this. – Bram Vanroy Jul 30 '17 at 14:16
  • Can't you just install a `tar` program on Windows? Might be easier in the long run. – melpomene Jul 30 '17 at 14:20
  • @melpomene I could. But how would I then write a script that is generic enough that I don't need to change anything for it to work under Linux (built-in `tar`) and Windows (not built-in)? (The `tar`ing is not standalone, and is part of a larger Perl script.) – Bram Vanroy Jul 30 '17 at 14:26
  • 1
    @BramVanroy: Okay, thank you. Did you look at [`Archive::Tar::Streamed`](https://metacpan.org/pod/Archive::Tar::Streamed) as described in [the question that **melpomene** linked to](https://stackoverflow.com/questions/653127)? Contrary to the accepted answer, it ***doesn't*** require a `tar` command line utility, and so should be fine on your Windows systems. The documentation says *"It also aims to be portable, and available on platforms without a native tar"*. – Borodin Jul 30 '17 at 14:26
  • @Borodin Thanks for looking into it. I looked at it, but as far as I can tell it is not possible to add the compression method as an argument in the Streamed class. Would this mean I have to 'bzip' the created `tar` file, by using Archive::Tar anyway? But wouldn't that mean that the whole `tar` (possibly hundreds of gigabyte) needs to be read in memory? – Bram Vanroy Jul 30 '17 at 14:38
  • You could probably use [IO::Compress::Bzip2](https://metacpan.org/pod/IO::Compress::Bzip2) for bzipping. – melpomene Jul 30 '17 at 14:43
  • @Borodin and melpomene, please see my edit. – Bram Vanroy Jul 30 '17 at 15:02
  • *"`Streamed` allows to only need one file in memory completely, and then adding that to the archive, one by one?"* The main difference from `Archive::Tar` is that the tar file is built incrementally on disk instead of in memory. Adding a file or a list of files to the archive will require all of those files' data in memory whichever module is used. This can be minimised by adding only one file at a time. Does your data include any multi-gigabyte individual files? I've written a short solution and will post it tomorrow unless you have files that don't fit in memory. – Borodin Jul 30 '17 at 20:14
  • @Borodin Unfortunately, as I've written in my post *I have 32GB available in my current system, but it gets flooded almost immediately. Some files in the directory I am trying to add to the archive exceed 32GB.* So yes, some files are larger than my available memory. Nonetheless, please do post your solution because I will need something similar for a directory that consists of loads of sub-directories which contain all many *small* files. By the way, just to be clear: just because it is not possible in Perl, does not mean I cannot do this in simple command line (backtick operator), right? – Bram Vanroy Jul 30 '17 at 20:19
  • 1
    @BramVanroy: I will look at writing a variant of `Archive::Tar` tomorrow which streams the individual files to the output as well as the whole archive. It shouldn't be hard. Meanwhile, yes you can use `system` or backticks. Windows doesn't come with a tar or a bzip2 command-line archiver, but [*GnuWin* provides both](http://gnuwin32.sourceforge.net/packages.html). – Borodin Jul 30 '17 at 20:26
  • @Borodin Sounds interesting. Perhaps the quote in Sinan Ünür's answer is of some help. If you plan to create your own Archive module and throw it on CPAN, that'd be cool! I can imagine that many people who are working with big data and Perl (and also use a Windows machine to test some stuff) find this useful as well! – Bram Vanroy Jul 30 '17 at 21:00
  • @BramVanroy: It may end up on CPAN, but it will take a lot of work to make it worthy for that. I intend to write something that will work just for your situation at present. Supporting things like symbolic links can wait until later. – Borodin Jul 30 '17 at 22:59

2 Answers2

1

Unfortunately, what you want is not possible in Perl:

I agree, it would be nice if this module could write the files in chunks and then rewrite the headers afterwards (to maintain the relationship of Archive::Tar doing the writing). You could maybe walk the archive backwards knowing you split the file into N entries, remove the extra headers, and update the first header with the sum of their sizes.

At the moment the only options are: use Archive::Tar::File, split the data into manageable sizes outside of perl, or use the tar command directly (to use it from perl, there's a nice wrapper on CPAN: Archive::Tar::Wrapper).

I don't think we'll ever have a truly non-memory-resident tar implementation in Perl based on Archive::Tar. To be honest, Archive::Tar itself needs to be rewritten or succeeded by something else.

Community
  • 1
  • 1
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
1

This is the original version of my solution, which still stores a whole file in memory. I probably shan't have time today to add an update which only stores partial files, as the Archive::Tar module doesn't have the friendliest API

use strict;
use warnings 'all';
use autodie; # Remove need for checks on IO calls

use File::Find 'find';
use Archive::Tar::Streamed ();
use Compress::Raw::Bzip2;
use Time::HiRes qw/ gettimeofday tv_interval /;

# Set a default root directory for testing
#
BEGIN {
    our @ARGV;
    @ARGV = 'E:\test' unless @ARGV;
}

use constant ROOT_DIR => shift;

use constant KB => 1024;
use constant MB => KB * KB;
use constant GB => MB * KB;

STDOUT->autoflush; # Make sure console output isn't buffered

my $t0 = [ gettimeofday ];

# Create a pipe, and fork a child that will build a tar archive
# from the files and pass the result to the pipe as it is built
#
# The parent reads from the pipe and passes each chunk to the
# module for compression. The result of zipping each block is
# written directly to the bzip2 file
#
pipe( my $pipe_from_tar, my $pipe_to_parent );  # Make our pipe
my $pid  = fork;                      # fork the process

if ( $pid == 0 ) {    # child builds tar and writes it to the pipe

    $pipe_from_tar->close;    # Close the parent side of the pipe
    $pipe_to_parent->binmode;
    $pipe_to_parent->autoflush; 

    # Create the ATS object, specifiying that the tarred output
    # will be passed straight to the pipe
    #
    my $tar = Archive::Tar::Streamed->new( $pipe_to_parent );

    find(sub {

        my $file = File::Spec->canonpath( $File::Find::name );
        $tar->add( $file );

        print "Processing $file\n" if -d;

    }, ROOT_DIR );

    $tar->writeeof; # This is undocumented but essential

    $pipe_to_parent->close;
}
else {    # parent reads the tarred data, bzips it, and writes it to the file

    $pipe_to_parent->close; # Close the child side of the pipe
    $pipe_from_tar->binmode;

    open my $bz2_fh, '>:raw', 'T:\test.tar.bz2';
    $bz2_fh->autoflush;

    # The first parameter *must* have a value of zero. The default
    # is to accumulate each zipped chunnk into the output variable,
    # whereas we want to write each chunk to a file
    #
    my ( $bz, $status ) = Compress::Raw::Bzip2->new( 0 );
    defined $bz or die "Cannot create bunzip2 object: $status\n";

    my $zipped;

    while ( my $len = read $pipe_from_tar, my $buff, 8 * MB ) {

        $status = $bz->bzdeflate( $buff, $zipped );
        $bz2_fh->print( $zipped ) if length $zipped;
    }

    $pipe_from_tar->close;

    $status = $bz->bzclose( $zipped );
    $bz2_fh->print( $zipped ) if length $zipped;

    $bz2_fh->close;

    my $elapsed = tv_interval( $t0 );

    printf "\nProcessing took %s\n", hms($elapsed);
}


use constant MINUTE => 60;
use constant HOUR   => MINUTE * 60;

sub hms {
    my ($s) = @_;

    my @ret;

    if ( $s > HOUR ) {
        my $h = int($s / HOUR);
        $s -= $h * HOUR;
        push @ret, "${h}h";
    }

    if ( $s > MINUTE or @ret ) {
        my $m = int($s / MINUTE);
        $s -= $m * MINUTE;
        push @ret, "${m}m";
    }

    push @ret, sprintf "%.1fs", $s;

    "@ret";
}
Borodin
  • 126,100
  • 9
  • 70
  • 144