8

I found out that if you sort a list of files by file extension rather than alphabetically before putting them in a tar archive, you can dramatically increase the compression ratio (especially for large source trees where you likely have lots of .c, .o, and .h files).

I couldn't find an easy way to sort files using the shell that works in every case the way I'd expect. An easy solution such as find | rev | sort | rev does the job but the files appear in an odd order, and it doesn't arrange them as nicely for the best compression ratio. Other tools such as ls -X don't work with find, and sort -t. -k 2,2 -k 1,1 messes up when files have more than one period in the filename (e.g. version-1.5.tar). Another quick-n-dirty option, using sed replaces the last period with a / (which never occurs in a filename), then sorts, splitting along the /:

sed 's/\(\.[^.]*\)$/\/\1/' | sort -t/ -k 2,2 -k 1,1  |  sed 's/\/\([^/]*\)$/\1/'

However, once again this doesn't work using the output from find which has /s in the names, and all other characters (other than 0) are allowed in filenames in *nix.

I discovered that using Perl, you can write a custom comparison subroutine using the same output as cmp (similar to strcmp in C), and then run the perl sort function, passing your own custom comparison, which was easy to write with perl regular expressions. This is exactly what I did: I now have a perl script which calls

@lines = <STDIN>;
print sort myComparisonFunction @lines;

However, perl is not as portable as bash, so I want to be able to do with with a shell script. In addition, find does not put a trailing / on directory names so the script thinks directories are the same as files without an extension. Ideally, I'd like to have tar read all the directories first, then regular files (and sort them), then symbolic links which I can achieve via

cat <(find -type d) <(find -type f | perl exsort.pl) <(find -not -type d -and -not -type f) | tar --no-recursion -T - -cvf myfile.tar

but I still run into the issue that either I have to type this monstrosity every time, or I have both a shell script for this long line AND a perl script for sorting, and perl isn't available everywhere so stuffing everything into one perl script isn't a great solution either. (I'm mainly focused on older computers, cause nowadays all modern Linux and OSX come with a recent enough version of perl).

I'd like to be able to put everything together into one shell script, but I don't know how to pass a custom function to GNU sort tool. Am I out of luck, and have to use one perl script? Or can I do this with one shell script?

EDIT: Thanks for the idea of a Schwartizan Transform. I used a slightly different method, using sed. My final sorting routine is as follows:

sed 's_^\(\([^/]*/\)*\)\(.*\)\(\.[^\./]*\)$_\4/\3/\1_' | sed 's_^\(\([^/]*/\)*\)\([^\./]\+\)$_/\3/\1_' | sort -t/ -k1,1 -k2,2 -k3,3 | sed 's_^\([^/]*\)/\([^/]*\)/\(.*\)$_\3\2\1_'

This handles special characters (such as *) in filenames and places files without an extension first because they are often text files. (Makefile, COPYING, README, configure, etc.).

P.S. In case anyone wants my original comparison function or think I could improve on it, here it is:

sub comparison {
    my $first = $a;
    my $second = $b;
    my $fdir = $first =~ s/^(([^\/]*\/)*)([^\/]*)$/$1/r;
    my $sdir = $second =~ s/^(([^\/]*\/)*)([^\/]*)$/$1/r;
    my $fname = $first =~ s/^([^\/]*\/)*([^\/]*)$/$2/r;
    my $sname = $second =~ s/^([^\/]*\/)*([^\/]*)$/$2/r;
    my $fbase = $fname =~ s/^(([^\.]*\.)*)([^\.]*)$/$1/r;
    my $sbase = $sname =~ s/^(([^\.]*\.)*)([^\.]*)$/$1/r;
    my $fext = $fname =~ s/^([^\.]*\.)*([^\.]*)$/$2/r;
    my $sext = $sname =~ s/^([^\.]*\.)*([^\.]*)$/$2/r;
    if ($fbase eq "" && $sbase ne ""){
        return -1;
    }
    if ($sbase eq "" && $fbase ne ""){
        return 1;
    }
    (($fext cmp $sext) or ($fbase cmp $sbase)) or ($fdir cmp $sdir)
}
Leo Izen
  • 4,165
  • 7
  • 37
  • 56
  • 4
    "Perl is not as portable as bash..." What??? – ThisSuitIsBlackNot Dec 31 '13 at 17:43
  • @ThisSuitIsBlackNot on older computers. For example, older versions of Mac OS don't come with it by default. – Leo Izen Dec 31 '13 at 17:44
  • 2
    That is not the same as portable. Windows doesn't come with bash either. – ThisSuitIsBlackNot Dec 31 '13 at 17:45
  • 2
    +1 for overall question AND the info about tar efficiency!. Could you extract the extension, append it as temporary first field, and sort that with `-k1 -k2` (k2 being the full pathname), and remove the tmp field from the final list that is passed to tar? – shellter Dec 31 '13 at 17:49
  • 2
    find can include the /: `find PATHS -type d -printf '%p/\n' -o -print` (also possible in non-GNU find with -exec printf, but harder) – ysth Dec 31 '13 at 18:08

3 Answers3

8

If you're familiar with Perl, you can use a Schwartzian Tranform in BASH too.

A Schwartian Transform is merely adding to your sorting information the sort key you desire, do the sort, then remove the sort key. It was created by Randal Schwartz and is used heavily in Perl. However, it's also good to use in other languages too:

You want to sort your files by extension:

find . -type f 2> /dev/null | while read file   #Assuming no strange characters or white space
do
    suffix=${file##*.}
    printf "%-10.10s %s\n" "$suffix" "$file"
done | sort | awk '{print substr( $0, 8 ) }' > files_to_tar.txt

I'm reading each file in with my find. I use printf to prepend my file name with the suffix I want to sort by. Then, I do my sort. My awk strips my sort key off leaving just my file name which are still sorted by suffix.

Now, your files_to_tar.txt file contains the names of your files sorted by suffix. You can use the -T parameter of tar to read the names of the files from this file:

$ tar -czvf backup.tar.gz -T files_to_tar.txt
David W.
  • 105,218
  • 39
  • 216
  • 337
  • I ended up doing it with sed so I didn't have to worry about parentheses in filenames. My sort function is this: `sed 's_^\(\([^/]*/\)*\)\(.*\)\(\.[^\./]*\)$_\4/\3/\1_' | sed 's_^\(\([^/]*/\)*\)\([^\./]\+\)$_/\3/\1_' | sort -t/ -k1,1 -k2,2 -k3,3 | sed 's_^\([^/]*\)/\([^/]*\)/\(.*\)$_\3\2\1_'` – Leo Izen Jan 01 '14 at 21:20
  • +1, you might want to handle files without extensions as well – Zaid Jan 02 '14 at 15:12
  • @Zaid the sed routine I used above puts files without extensions first, as most of them are text files (Makefile, COPYING, configure, etc.) – Leo Izen Jan 02 '14 at 19:29
1

you could pipe result of find to ls -X, using xargs, (read man page here) which should sort them by extension,

cat <(find -type d) <(find -type f | xargs ls -X ) <(find -not -type d -and -not -type f) | tar --no-recursion -T - -cvf myfile.tar
dpp
  • 1,748
  • 11
  • 7
  • +1 for elegant solution, but it's not perfect. For some reason, even when using `ls -1Xd`, `ls` still largely groups files based on file extension but doesn't fully group them, and it doesn't group files by extension, and then files by name within each extension (this is an issue when you've got multiple versions of a source tree lying around, so you might have multiple copies of thisfilename.c and multiple thatfilename.c, which are mostly identical). – Leo Izen Jan 02 '14 at 03:10
0

To sort by extension to group similar files, and then my md5sum to group identical files:

find $your_dir | xargs md5sum | sed 's/  /\x00/; s/\.[^.]$/&\x00&/' | sort -t'\0' -k3,3 | cut -d '' -f2

Note sort -k3,3 is the extension sort, and the default "last resort" sort done will group the files by md5sum.

Also consider xz instead of gz if you are worried about space

pixelbeat
  • 30,615
  • 9
  • 51
  • 60
  • "Assuming no tabs or spaces in files" sorry, but fat chance of that being a safe assumption "Also consider xz instead of gz if you are worried about space" Of course. But the difference between a sorted .tar.xz and an unsorted is dramatic. – Leo Izen Jan 22 '15 at 02:21
  • Well it was for illustration. I've amended to use NUL delimiter for robust operation. – pixelbeat Jan 22 '15 at 10:33
  • Also, I have absolutely no idea why you're using an MD5 checksum. Checksums are designed to be as discontinuous as possible. That makes absolutely no sense when sorting. (The idea isn't to collect identical files together. It's to collect files with similar contents, sorted by file extension. #include shows up a lot in .c files so those should be grouped together for maximum compression ratio. They don't have to be identical.) – Leo Izen Jan 24 '15 at 07:00
  • The md5sum sort is only done after the extension sort. I.E. the extension sort is used to implicitly get similar files together, and the md5sum is just an optional extra to group identical files together. If you wanted something in between you could use a similarity hash instead http://stackoverflow.com/questions/4323977/string-similarity-score-hash but that's overkill IMHO as you'd be essentially duplicating logic within the compression program itself. It might be useful though for less sophisticated (de)compressors – pixelbeat Jan 24 '15 at 21:48
  • If you're using XZ then a 64 MiB dictionary is very much enough to see that redundancy. This might be necessary with gzip's 32k dictionary, but XZ has a large enough dictionary to make this unnecessary. – Leo Izen Feb 08 '15 at 13:29