23

I got a sparse file of 1TB which stores actually 32MB data on Linux.

Is it possible to "efficiently" make a package to store the sparse file? The package should be unpacked to be a 1TB sparse file on another computer. Ideally, the "package" should be around 32MB.

Note: On possible solution is to use 'tar': https://wiki.archlinux.org/index.php/Sparse_file#Archiving_with_.60tar.27

However, for a 1TB sparse file, although the tar ball may be small, archiving the sparse file will take too long a time.

Edit 1

I tested the tar and gzip and the results are as follows (Note that this sparse file contains data of 0 byte).

$ du -hs sparse-1
0   sparse-1

$ ls -lha sparse-1
-rw-rw-r-- 1 user1 user1 1.0T 2012-11-03 11:17 sparse-1

$ time tar cSf sparse-1.tar sparse-1

real    96m19.847s
user    22m3.314s
sys     52m32.272s

$ time gzip sparse-1

real    200m18.714s
user    164m33.835s
sys     10m39.971s

$ ls -lha sparse-1*
-rw-rw-r-- 1 user1 user1 1018M 2012-11-03 11:17 sparse-1.gz
-rw-rw-r-- 1 user1 user1   10K 2012-11-06 23:13 sparse-1.tar

The 1TB file sparse-1 which contains 0 byte data can be archived by 'tar' to a 10KB tar ball or compressed by gzip to a ~1GB file. gzip takes around 2 times of the time than the time tar uses.

From the comparison, 'tar' seems better than gzip.

However, 96 minutes are too long for a sparse file that contains data of 0 byte.

Edit 2

rsync seems finish copying the file in more time than tar but less than gzip:

$ time rsync --sparse sparse-1 sparse-1-copy

real    124m46.321s
user    107m15.084s
sys     83m8.323s

$ du -hs sparse-1-copy 
4.0K    sparse-1-copy

Hence, tar + cp or scp should be faster than directly rsync for this extremely sparse file.

Edit 3

Thanks to @mvp for pointing out the SEEK_HOLE functionality in newer kernel. (I previously work on a 2.6.32 Linux kernel).

Note: bsdtar version >=3.0.4 is required (check here: http://ask.fclose.com/4/how-to-efficiently-archive-a-very-large-sparse-file?show=299#c299 ).

On a newer kernel and Fedora release (17), tar and cp handles the sparse file very efficiently.

[zma@office tmp]$ ls -lh pmem-1 

-rw-rw-r-- 1 zma zma 1.0T Nov  7 20:14 pmem-1
[zma@office tmp]$ time tar cSf pmem-1.tar pmem-1

real    0m0.003s
user    0m0.003s
sys 0m0.000s
[zma@office tmp]$ time cp pmem-1 pmem-1-copy

real    0m0.020s
user    0m0.000s
sys 0m0.003s
[zma@office tmp]$ ls -lh pmem*
-rw-rw-r-- 1 zma zma 1.0T Nov  7 20:14 pmem-1
-rw-rw-r-- 1 zma zma 1.0T Nov  7 20:15 pmem-1-copy
-rw-rw-r-- 1 zma zma  10K Nov  7 20:15 pmem-1.tar
[zma@office tmp]$ mkdir t
[zma@office tmp]$ cd t
[zma@office t]$ time tar xSf ../pmem-1.tar 

real    0m0.003s
user    0m0.000s
sys 0m0.002s
[zma@office t]$ ls -lha
total 8.0K
drwxrwxr-x   2 zma  zma  4.0K Nov  7 20:16 .
drwxrwxrwt. 35 root root 4.0K Nov  7 20:16 ..
-rw-rw-r--   1 zma  zma  1.0T Nov  7 20:14 pmem-1

I am using a 3.6.5 kernel:

[zma@office t]$ uname -a
Linux office.zhiqiangma.com 3.6.5-1.fc17.x86_64 #1 SMP Wed Oct 31 19:37:18 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
Deduplicator
  • 44,692
  • 7
  • 66
  • 118
ericzma
  • 763
  • 3
  • 9
  • 23
  • `gzip` or `bzip2` should do a beautiful job compressing it. `pigz` and `pbzip2` are their respective modern equivalents that utilize all the cores. You'll be pleasantly surprised how quickly they run. – Marcin Nov 06 '12 at 14:13
  • 2
    @Marcin compression by gzip seems worse than tar. Please find the updated question with the results of gzip and tar. – ericzma Nov 07 '12 at 08:34
  • When you say "a sparse file of 0 byte" do you mean every byte is 0? That's a different question. – Matthew Strawbridge Nov 07 '12 at 08:43
  • @MatthewStrawbridge I meant that the sparse file contains data of 0 byte (no data). – ericzma Nov 07 '12 at 08:46
  • Wow, that's really sparse ;-) In that case you can "compress" it to a single value: the number of bytes in the file! – Matthew Strawbridge Nov 07 '12 at 08:52
  • 2
    gzip performs ridiculously poorly for data that has long strings of repeated characters. LZMA is not much better. long stretches of 0's 1's or anything else get spectacularly compressed by bzip. I had a 1.8GB file with mostly (90%) zeros and the rest random integers. it got compressed to around 800kB. the speed sucks though. – staticd Oct 07 '13 at 07:27

5 Answers5

32

Short answer: Use bsdtar or GNU tar (version 1.29 or later) to create archives, and GNU tar (version 1.26 or later) to extract them on another box.

Long answer: There are some requirements for this to work.

First, Linux must be at least kernel 3.1 (Ubuntu 12.04 or later would do), so it supports SEEK_HOLE functionality.

Then, you need tar utility that can support this syscall. GNU tar supports it since version 1.29 (released on 2016/05/16, it should be present by default since Ubuntu 18.04), or bsdtar since version 3.0.4 (available since Ubuntu 12.04) - install it using sudo apt-get install bsdtar.

While bsdtar (which uses libarchive) is awesome, unfortunately, it is not very smart when it comes to untarring - it stupidly requires to have at least as much free space on target drive as untarred file size, without regard to holes. GNU tar will untar such sparse archives efficiently and will not check this condition.

This is log from Ubuntu 12.10 (Linux kernel 3.5):

$ dd if=/dev/zero of=1tb seek=1T bs=1 count=1
1+0 records in
1+0 records out
1 byte (1 B) copied, 0.000143113 s, 7.0 kB/s

$ time bsdtar cvfz sparse.tar.gz 1tb 
a 1tb

real    0m0.362s
user    0m0.336s
sys 0m0.020s

# Or, use gnu tar if version is later than 1.29:
$ time tar cSvfz sparse-gnutar.tar.gz 1tb
1tb

real    0m0.005s
user    0m0.006s
sys 0m0.000s

$ ls -l
-rw-rw-r-- 1 autouser autouser 1099511627777 Nov  7 01:43 1tb
-rw-rw-r-- 1 autouser autouser           257 Nov  7 01:43 sparse.tar.gz
-rw-rw-r-- 1 autouser autouser           134 Nov  7 01:43 sparse-gnutar.tar.gz
$

Like I said above, unfortunately, untarring with bsdtar will not work unless you have 1TB free space. However, any version of GNU tar works just fine to untar such sparse.tar:

$ rm 1tb 
$ time tar -xvSf sparse.tar.gz 
1tb

real    0m0.031s
user    0m0.016s
sys 0m0.016s
$ ls -l
total 8
-rw-rw-r-- 1 autouser autouser 1099511627777 Nov  7 01:43 1tb
-rw-rw-r-- 1 autouser autouser           257 Nov  7 01:43 sparse.tar.gz
mvp
  • 111,019
  • 13
  • 122
  • 148
  • 3
    Awesome! I guess the SEEK_HOLE plays the trick! I tried the `tar` and `cp` on a 3.6.5 Linux kernel and both are very fast. Thanks! – ericzma Nov 07 '12 at 12:22
  • Is the requirement of Linux kernel 3.1 or later the case even if using a later version of libarchive? It looks like there's code which makes use of FIEMAP ioctl in versions 3.x of libarchive. https://github.com/libarchive/libarchive/blob/master/libarchive/archive_read_disk_entry_from_file.c#L1011 – bockmabe Oct 21 '13 at 21:54
  • 2
    Sadly, 1.5 years since I wrote this, GNU `tar` *still* has not learned to parse holes effectively, so this recipe is still very much relevant! :(... – mvp May 22 '14 at 08:44
  • I tried this with a 1MB empty sparse file and found that bsdtar handled this like a non-sparse file. For a 2tb sparse file with something in between it worked like as described above. Maybe it only works for very large files? – Alfe Sep 08 '17 at 12:05
  • 2
    Finally, GNU tar supports this properly since version 1.29 ;-) – mvp Sep 19 '19 at 00:59
8

I realize this question is very old, but here's an update that may be helpful to others who find their way here the same way I did.

Thankfully, mvp's excellent answer is now obsolete. According to the GNU tar release notes, SEEK_HOLE/SEEK_DATA was added in v. 1.29, released 2016-05-16. (And with GNU tar v. 1.30 being standard in Debian stable now, it's safe to assume that tar version ≥ 1.29 is available almost everywhere.)

So the way to handle sparse files now is to archive them with whichever tar (GNU or BSD) is installed on your system, and same for extracting.

Additionally, for sparse files that actually contain some data, if it's worthwhile to use compression (ie the data is compressible enough to save substantial disk space, and the disk space savings are worth the likely-substantial time and CPU resources required to compress it):

  • tar -cSjf <archive>.tar.bz2 /path/to/sparse/file will both take advantage of tar's SEEK_HOLE functionality to quickly & efficiently archive the sparse file, and use bzip2 to compress the actual data.
  • tar --use-compress-program=pbzip2 -cSf <archive>.tar.bz2 /path/to/sparse/file, as alluded to in marcin's comment, will do the same while also using multiple cores for the compression task.

On my little home server with a quad-core Atom CPU, using pbzip2 vs bzip2 reduced the time by around 25 or 30%.

With or without compression, this will give you an archive that doesn't need any special sparse-file handling, takes up approximately the 'real' size of the original sparse file (or less if compressed), and can be moved around without worrying about inconsistency between different utilities' sparse file capabilities. For example: cp will automatically detect sparse files and do the right thing, rsync will handle sparse files properly if you use the -S flag, and scp has no option for sparse files (it will consume bandwidth copying zeros for all the holes and the resulting copy will be a non-sparse file whose size is the 'apparent' size of the original); but all of them will of course handle a tar archive just fine—whether it contains sparse files or not—without any special flags.

Additional Notes

  1. When extracting, tar will automatically detect an archive created with -S so there's no need to specify it.
  2. An archive created with pbzip2 is stored in chunks. This results in the archive being marginally bigger than if bzip2 is used, but also means that the extraction can be multithreaded, unlike an archive created with bzip2.
  3. pbzip2 and bzip2 will reliably extract each other's archives without error or corruption.
Askeli
  • 191
  • 1
  • 4
  • 1
    Thanks for notifying regarding tar 1.29 - this is great news! Btw, modern `cp` utility is automatically taking advantage of this and copying sparse files efficiently. – mvp Sep 19 '19 at 01:01
  • Good point, @mvp I have edited my answer to clarify that part. – Askeli Sep 20 '19 at 11:34
  • Wonderful answer! Especially the last paragraph regarding cp and scp and alluding to how to efficiently move large images from remote locations, etc. Great points, and adjusting backup scripts now! – oemb1905 Nov 03 '21 at 16:47
3

From a related question, maybe rsync will work:

rsync --sparse sparse-1 sparse-1-copy
Community
  • 1
  • 1
wallyk
  • 56,922
  • 16
  • 83
  • 148
  • I tried this and after several minutes I killed it since it seems very busy there (two rsync processes that took ~89% and ~62% CPU). I do not expect that rsync works better than tar for this purpose. But I am giving it another try since the server is idle currently. – ericzma Nov 07 '12 at 09:00
  • `rsync` seems finish copying the file in more time than `tar` but less than `gzip`. The results are in **Edit 2** of the question. – ericzma Nov 07 '12 at 11:58
  • 1
    `rsync` is not a fast or efficient program for copying files disk-to-disk, but it does have a lot of options you may not find elsewhere. You can use `rsync -S ...` to copy sparse files over a LAN, e.g over `ssh`. For copy disk-to-disk, just use `cp --sparse= – James Stevens Nov 12 '20 at 14:58
  • 1
    @James: Thanks! It is amazing how these utilities have evolved. – wallyk Nov 12 '20 at 18:08
2

Both the xz (since version 5.0.0) and zstd (since version 0.7.0) compression tools support sparse files.

For a quick test I created a 10GiB sparse file with 5MiB of actual (random) data at the very end.

% dd if=/dev/random of=file.img bs=5M count=1 seek=2047
1+0 records in
1+0 records out
5242880 bytes (5,2 MB, 5,0 MiB) copied, 0,0223623 s, 234 MB/s
% du -h --apparent-size file.img
10G file.img
% du -h file.img
5,0M    file.img
% sha1sum file.img
eb8104d1c1f8ac9dd502f7010f1625b283a8e423  file.img

xz was able to compress it to non-sparse 6.5MiB file in 3m36s, and decompress back to the same 10GiB sparse file in 16s. I used the default single-thread mode here, it also works (and a bit faster) in multi-thread mode.

% xz --version
xz (XZ Utils) 5.2.5
liblzma 5.2.5

% xz file.img 
% du -h --apparent-size file.img.xz
6,5M    file.img.xz
% du -h file.img.xz
6,5M    file.img.xz
% sha1sum file.img.xz
685d2fe4cd19a02eb4a17f77f9a89decf6c59b73  file.img.xz

% unxz file.img.xz 
% du -h --apparent-size file.img         
10G file.img
% du -h file.img  
5,0M    file.img
% sha1sum file.img
eb8104d1c1f8ac9dd502f7010f1625b283a8e423  file.img

zstd can do the same, but slightly better and a lot faster. It compressed the sparse file to a 5.4MiB non-sparse file in 4s, and decompressed it back to the same 10GiB sparse file in 2s.

% zstd --version
*** zstd command line interface 64-bits v1.5.2, by Yann Collet ***

% zstd --rm file.img
file.img             :  0.05%   (  10.0 GiB =>   5.32 MiB, file.img.zst)
% du -h --apparent-size file.img.zst
5,4M    file.img.zst
% du -h file.img.zst
5,4M    file.img.zst
% sha1sum file.img.zst 
b1dda0c1f83bdfbf2094f1d39810edb379602cb3  file.img.zst

% unzstd --rm file.img.zst
file.img.zst        : 10737418240 bytes                                        
% du -h --apparent-size file.img
10G file.img
% du -h file.img
5,0M    file.img
% sha1sum file.img
eb8104d1c1f8ac9dd502f7010f1625b283a8e423  file.img
Corubba
  • 2,229
  • 24
  • 30
-3

You're definitely looking for a compression tool such as tar, lzma, bzip2, zip or rar. According to this site, lzma is quite fast while still having quite a good compression ratio:

http://blog.terzza.com/linux-compression-comparison-gzip-vs-bzip2-vs-lzma-vs-zip-vs-compress/

You can also adjust the speed/quality ratio of the compression by setting the compression level to something low, experiment a bit to find a level that works best

http://linux.die.net/man/1/unlzma

LukeGT
  • 2,324
  • 1
  • 21
  • 20
  • 1
    Compression by gzip seems worse than simply archiving the file using tar. Please find the updated question with the results of gzip and tar. Archiving seems still too slow for handling a file that contains 0 byte. – ericzma Nov 07 '12 at 08:35
  • Bzip has the slowest decompression speed among gz and LZMA for all compression ratios – staticd Oct 07 '13 at 07:22
  • Thanks @staticd, I misread the graph. I removed that recommendation from my answer. – LukeGT Feb 18 '17 at 03:05