17

If I have a big file containing many zeros, how can i efficiently make it a sparse file?

Is the only possibility to read the whole file (including all zeroes, which may patrially be stored sparse) and to rewrite it to a new file using seek to skip the zero areas?

Or is there a possibility to make this in an existing file (e.g. File.setSparse(long start, long end))?

I'm looking for a solution in Java or some Linux commands, Filesystem will be ext3 or similar.

mgutt
  • 5,867
  • 2
  • 50
  • 77
rurouni
  • 2,315
  • 1
  • 19
  • 27
  • 4
    The first solution is implemented in 'cp --sparse=always', but that is not efficient and requires copying the file and moving afterwards. – rurouni May 13 '11 at 08:39
  • 1
    http://stackoverflow.com/questions/245251/create-file-with-given-size-in-java – joe776 May 13 '11 at 08:41
  • 2
    @joe: that is about creating a sparse file from scratch, but I want ta make an existing file sparse. – rurouni May 13 '11 at 08:45
  • @rurouni Sorry, overread that part. Sounds pretty tedious to do it from Java. – joe776 May 13 '11 at 08:51
  • If your current file format contains lots of zeros, can you change your file format so it doesn't need to? – Peter Lawrey May 13 '11 at 08:53
  • @joe: I assume it will be impossible in Java but I would expect a linux tool to exist if this is possible at all (which should be, because this only means to change the inode and remove block references) – rurouni May 13 '11 at 09:00
  • @peter: the file format is optimized for performance (high access speed is even more important than having it sparse) and holes may open and close in different areas over time (but often in consecutive ranges). These files take up a few terabytes and about 50-80% are zeroes. – rurouni May 13 '11 at 09:07
  • 1
    @runouni, If the holes are large enough, perhaps it is worth breaking up the file and using the filesystem to delete/remove sections. – Peter Lawrey May 13 '11 at 09:15
  • 1
    Making a file sparse would result in those sections being fragmented if they were ever re-used. I think you would be better off pre-allocating the whole file and maintaining a table/BitSet of the pages/sections which are occupied. Perhaps saving a few TB of disk space is not worth the performance hit of a highly fragmented file. – Peter Lawrey May 13 '11 at 09:21
  • @peter: that might be a solution, sometimes I don't see the obvious solution ;-) – rurouni May 13 '11 at 09:24
  • @rurouni, I can see you would like the OS to do that for you. But I don't think there is an easy way in Java (or even C) to get the OS to do it. – Peter Lawrey May 13 '11 at 09:33

5 Answers5

24

A lot's changed in 8 years.

Fallocate

fallocate -d filename can be used to punch holes in existing files. From the fallocate(1) man page:

-d, --dig-holes
  Detect and dig holes.  This makes the file sparse in-place,
  without using extra disk space.  The minimum size of the hole
  depends on filesystem I/O block size (usually 4096 bytes).
  Also, when using this option, --keep-size is implied.  If no
  range is specified by --offset and --length, then the entire
  file is analyzed for holes.

  You can think of this option as doing a "cp --sparse" and then
  renaming the destination file to the original, without the
  need for extra disk space.

  See --punch-hole for a list of supported filesystems.

(That list:)

Supported for XFS (since Linux 2.6.38), ext4 (since Linux
3.0), Btrfs (since Linux 3.7) and tmpfs (since Linux 3.5).

tmpfs being on that list is the one I find most interesting. The filesystem itself is efficient enough to only consume as much RAM as it needs to store its contents, but making the contents sparse can potentially increase that efficiency even further.

GNU cp

Additionally, somewhere along the way GNU cp gained an understanding of sparse files. Quoting the cp(1) man page regarding its default mode, --sparse=auto:

sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well.

But there's also --sparse=always, which activates the file-copy equivalent of what fallocate -d does in-place:

Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes.

I've finally been able to retire my tar cpSf - SOURCE | (cd DESTDIR && tar xpSf -) one-liner, which for 20 years was my graybeard way of copying sparse files with their sparseness preserved.

FeRD
  • 1,699
  • 15
  • 24
  • 2
    Thank you. Your hint for GNU cp helped me. It works fast where other tools (e.g. `rsync --sparse`) were slow. – dsteinkopf Oct 18 '19 at 02:28
4

Some filesystems on Linux / UNIX have the ability to "punch holes" into an existing file. See:

It's not very portable and not done the same way across the board; as of right now, I believe Java's IO libraries do not provide an interface for this.

If hole punching is available either via fcntl(F_FREESP) or via any other mechanism, it should be significantly faster than a copy/seek loop.

FrankH.
  • 17,675
  • 3
  • 44
  • 63
1

I think you would be better off pre-allocating the whole file and maintaining a table/BitSet of the pages/sections which are occupied.

Making a file sparse would result in those sections being fragmented if they were ever re-used. Perhaps saving a few TB of disk space is not worth the performance hit of a highly fragmented file.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
1

You can use $ truncate -s filename filesize on linux teminal to create sparse file having

only metadata.

NOTE --Filesize is in bytes.

Anil Arya
  • 3,100
  • 7
  • 43
  • 69
  • 4
    Two problems here: (1) Your arguments are backwards, it should be `truncate -s size filename`. (_size_ can actually be in any specified units, e.g. `10K` = 10240 bytes, `2MB` = 2000000 bytes). (2) The question asks about making an _existing_ file sparse, whereas this will only create a new sparse file (or extend an existing file with a sparse region at the end). – FeRD Jan 29 '19 at 11:09
0

According to this article, it seems there is currently no easy solution, except for using FIEMAP ioctl. However, I don't know how you can make "non sparse" zero blocks into "sparse" ones.

shodanex
  • 14,975
  • 11
  • 57
  • 91