5

If I create a file, use lseek(2) to jump to a high position in the (empty) file, then write some valuable information there, I create a sparse file on Unix system (probably depending on the file system I use, but let's assume I'm using a typical Unix file system like ext4 or similar, there this is the case).

If I then lseek(2) to an even higher position in the file, write something there as well, I end up with a sparse file which contains somewhere in its middle the valuable information, surrounded by a huge amount of sparse file. I'd like to find this valuable information within the file without having to read it completely.

Example:

$ python
f = open('sparse', 'w')
f.seek((1<<40) + 42)
f.write('foo')
f.seek((1<<40) * 2)
f.write('\0')
f.close()

This will create a 2TB file which uses only 8k of disk space:

$ du -h sparse 
8.0K    sparse

Somewhere in the middle of it (at 1TB + 42 bytes) is the valuable information (foo).

I can find it using cat sparse of course, but that will read the complete file and print immense amounts of zero bytes. I tried with smaller sizes and found that this method will take about 3h to print the three characters on my computer.

The question is:

Is there a way to find the information stored in a sparse file without reading all the empty blocks as well? Can I somehow find out where empty blocks are in a sparse file using standard Unix methods?

Alfe
  • 56,346
  • 20
  • 107
  • 159
  • 1
    It's called a sparse file not a spare file. – GIZ Sep 07 '17 at 20:15
  • @direprobs Right, thanks. Someone fixed that misspelling already, thanks to that guy, too. I should have become more suspicious when the tag wasn't known already. – Alfe Sep 08 '17 at 08:55
  • The problem with a sparse file is that, the filesystem produces the empty bytes at runtime, empty bytes are assumed to be data in the file as well. Even if you use `lseek(2)` with `SEEK_DATA` this doesn't work. So when you seek to the actual data, the filesystem treats the zeros as actual data though, although they're just zeros from our perspective. – GIZ Sep 08 '17 at 09:21
  • Yes. The question is, is there any way of figuring out where the sparse parts and where the non-sparse parts of the file are? Some low-level sparse file support? – Alfe Sep 08 '17 at 09:33
  • Found SEEK_HOLE in https://stackoverflow.com/questions/13252682/copying-a-1tb-sparse-file . This seems to be capable of handling sparse files effectively. – Alfe Sep 08 '17 at 12:10
  • 2
    See line 40 in `lseek(2)`. I'm not aware of a trick to cancel the sparse zeros to get only `foo` in the middle. Unfortunately, your question didn't receive the attention it deserves yet. – GIZ Sep 08 '17 at 12:45
  • I now found a "kind of" way by using `bsdtar cf - sparse` which creates a tar and prints it to stdout. The tar contains more or less human readable information about the sparseness of the file and the actual information. If nothing more decent shows up, I will probably post my own answer using this. – Alfe Sep 08 '17 at 13:23

1 Answers1

1

Just writing an answer based on the previous comments:

#!/usr/bin/env python3
from errno import ENXIO
from os import lseek
from sys import argv, stderr

SEEK_DATA = 3
SEEK_HOLE = 4

def get_ranges(fobj):
    ranges = []
    end = 0

    while True:
        try:
            start = lseek(fobj.fileno(), end, SEEK_DATA)
            end = lseek(fobj.fileno(), start, SEEK_HOLE)
            ranges.append((start, end))
        except OSError as e:
            if e.errno == ENXIO:
                return ranges

            raise

def main():
    if len(argv) < 2:
        print('Usage: %s <sparse_file>' % argv[0], file=stderr)
        raise SystemExit(1)

    try:
        with open(argv[1], 'rb') as f:
            ranges = get_ranges(f)
            for start, end in ranges:
                print('[%d:%d]' % (start, end))
                size = end-start
                length = min(20, size)
                f.seek(start)
                data = f.read(length)
                print(data)
    except OSError as e:
        print('Error:', e)
        raise SystemExit(1)

if __name__ == '__main__': main()

It probably doesn't do what you want, however, which is returning exactly the data you wrote. Zeroes may surround the returned data and must be trimmed by hand.

Current status of SEEK_DATA and SEEK_HOLE are described in https://man7.org/linux/man-pages/man2/lseek.2.html:

SEEK_DATA and SEEK_HOLE are nonstandard extensions also present in Solaris, FreeBSD, and DragonFly BSD; they are proposed for inclusion in the next POSIX revision (Issue 8).

hdante
  • 7,685
  • 3
  • 31
  • 36
  • It's based on the existence of the flags but I guess that's the closest we will get. And in the future it probably will be standard. – Alfe Jun 11 '20 at 00:09