3

There is bup backup program (https://github.com/bup/bup) based on some ideas and some functions from git version control system for compact storage of virtual machine images.

In bup there is bup ls subcommand, which can show some sha1-like hashes (same length of hex) of objects stored inside the backup when -s option is passed (in man bup-ls there is just "-s, --hash : show hash for each file/directory."). But the sha1-like hash is not equal to sha1sum output of original file.

Original git computes sha1 hash of data by prefixing data with `blob NNN\0' string, where NNN is size of object in bytes, written as decimal, according to How does git compute file hashes? and https://stackoverflow.com/a/28881708/

I tested prefix `blob NNN\0' and still not same sha1 sum.

What is the method of computing hash sum for files is used in bup? Is it linear sha1 or some tree-like variant like Merkle trees? What is the hash of directory?

The source of ls command of bup is https://github.com/bup/bup/blob/master/lib/bup/ls.py, and hash just printed in hex, but where the hash was generated?

def node_info(n, name, 
    ''' ....
    if show_hash:
        result += "%s " % n.hash.encode('hex')

Is that hash generated on creating bup backup (when file is placed inside to the backup by bup index + bup save commands) and just printed out on bup ls; or is it recomputed on every bup ls and can be used as integrity test of bup backup?

Community
  • 1
  • 1
osgx
  • 90,338
  • 53
  • 357
  • 513
  • https://github.com/bup/bup/blob/master/DESIGN – torek Jul 17 '16 at 06:03
  • torek, already checked, no exact info on hash calculation (there is sha1 type, but *not equal to sha1sum* result) – osgx Jul 17 '16 at 06:03
  • bup splits file-data into many files, using the technique described there. So the "bup hash" of a file is not a git file hash, because there's no single git blob corresponding to the file or directory being bup-save-d. So, see lines 556-560. There *is* a single git *tree* object for such a file and probably that's the bup hash. In any case the integrity of the entire backup (or even one group item in it) is clearly not check-able without checking all the underlying git objects, which requires running `git fsck` or equivalent. – torek Jul 17 '16 at 06:12
  • This adds up to "generated when creating the backup and just printed out", probably. (BTW I say "probably" because I don't *know*; if I knew, I'd write an answer, instead of a speculative comment that just leverages the obvious :-) ) – torek Jul 17 '16 at 06:14
  • For future maintainers of the above mentioned DESIGN document. Please tighten up the writing style - the current informal chatty style does not give the impression that this is a project to be taken seriously. – Thorbjørn Ravn Andersen Jul 17 '16 at 06:31
  • torek, There are many hashes in bup, but my question started about full file hash in `bup ls -s`, not about commitid hash or rolling hash of file parts (blocks). – osgx Jul 17 '16 at 07:02

1 Answers1

1

bup stores all data in a bare git repository (which by default is located at ~/.bup). Therefore bup's hash computation method exactly replicates the one used by git.

However, an important difference from git is that bup may split files into chunks. If bup decides to split a file into chunks, then the file is represented in the repository as a tree rather than as a blob. In that case bup's hash of the file coincides with git's hash of the corresponding tree.

The following script demonstrates that:

bup_hash_test

#!/bin/bash

bup init
BUPTEST=/tmp/bup_test
function test_bup_hash()
{
    bup index $BUPTEST &> /dev/null
    bup save -n buptest $BUPTEST &> /dev/null
    local buphash=$(bup ls -s buptest/latest$BUPTEST|cut -d' ' -f 1)
    echo "bup's hash: $buphash"
    echo "git's hash: $(git hash-object $BUPTEST)"
    echo git --git-dir \~/.bup cat-file -p $buphash
    git --git-dir ~/.bup cat-file -p $buphash
}

cat > $BUPTEST <<'END'
    http://pkgsrc.se/sysutils/bup
    http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/sysutils/bup/
END

test_bup_hash

echo
echo

echo " -1" >> $BUPTEST

echo "After appending ' -1' line:"
test_bup_hash

echo
echo

echo "After replacing '-' with '#':"
sed -i 's/-/#/' $BUPTEST
test_bup_hash

Output:

$ ./bup_hash_test
Initialized empty Git repository in ~/.bup/
bup's hash: b52baef90c17a508115ce05680bbb91d1d7bfd8d
git's hash: b52baef90c17a508115ce05680bbb91d1d7bfd8d
git --git-dir ~/.bup cat-file -p b52baef90c17a508115ce05680bbb91d1d7bfd8d
    http://pkgsrc.se/sysutils/bup
    http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/sysutils/bup/


After appending ' -1' line:
bup's hash: c95b4a1fe1956418cb0e58e0a2c519622d8ce767
git's hash: b5bc4094328634ce6e2f4c41458514bab5f5cd7e
git --git-dir ~/.bup cat-file -p c95b4a1fe1956418cb0e58e0a2c519622d8ce767
100644 blob aa7770f6a52237f29a5d10b350fe877bf4626bd6    00
100644 blob d00491fd7e5bb6fa28c517a0bb32b8b506539d4d    61


After replacing '-' with '#':
bup's hash: cda9a69f1cbe66ff44ea6530330e51528563e32a
git's hash: cda9a69f1cbe66ff44ea6530330e51528563e32a
git --git-dir ~/.bup cat-file -p cda9a69f1cbe66ff44ea6530330e51528563e32a
    http://pkgsrc.se/sysutils/bup
    http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/sysutils/bup/
 #1

As we can see, when bup's and git's hashes match, the corresponding object in the bup repository is a blob with the expected contents. When bup's and git's hashes do NOT match, the object with bup's hash is a tree. The contents of the blobs in that tree correspond to fragments of the full file:

$ git --git-dir ~/.bup cat-file -p aa7770f6a52237f29a5d10b350fe877bf4626bd6
    http://pkgsrc.se/sysutils/bup
    http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/sysutils/bup/
 -$ git --git-dir ~/.bup cat-file -p d00491fd7e5bb6fa28c517a0bb32b8b506539d4d
1
Leon
  • 31,443
  • 4
  • 72
  • 97
  • What did I? I did put 40GB file (partition image) in two versions into bup and then used sha1sum on bup-fuse to recheck. I want to have easier way of checking correctness of the backup. Now I have dd from partition to tmp folder (1st readout from disk), index, save (readout from tmp; bup work), mount bup-fuse, run sha1sum on partition and on file in fuse (3nd readout from disk, 2nd readout from bup). I want: dd which will compute correct sha1 on copy (on fly) - to skip source readout. And get real sum of saved file (bup on slow disk too; high cpu load from fuse). (I know the answer to my q, it – osgx Jul 17 '16 at 09:39
  • Bup is not a git! It just stole some ideas/formats from git; but it has its OWN variant of hash. It is hidden in sources of bup, but it can be found. "sha" variable name is not true sha1, it is some (variable) prefix; like in git annex; but not the same prefix. What is the prefix? What sha variant is used for commit-id? For directory? Your bug is not related to the question, can you repost it directly as issue at github bup/bup? PS: And this bug indicates that value of hash printed by `bup ls -s` is not documented. (what is your bup version?) – osgx Jul 17 '16 at 09:41
  • @osgx See updated answer. It contains final proof that bup fully depends on git for its hash computation. – Leon Jul 17 '16 at 20:56
  • And when they dont match, bup uses own git.py with different prefix scheme: `calc_hash(type, content): """Calculate some content's hash in the Git fashion.""" header = '%s %d\0' % (type, len(content)) sum = Sha1(header) sum.update(content) return sum.digest() `. Do you have any ideas about hash kinds used in bup (file hash, dir hash, commit_id, tree hash) and when which kind is used? What is your bup version? – osgx Jul 17 '16 at 22:33
  • @osgx `calc_hash()` doesn't use a different prefix scheme. It is exactly the same scheme as in `git`. Otherwise, why would the hash of a chunked file match the hash of the corresponding tree object in the repository? – Leon Jul 18 '16 at 06:44
  • Leon, what is there are two schemes in bup? git itself uses `Commit Hash (SHA1) = SHA1("blob" + " " + + "\0" + )` as shown in http://stackoverflow.com/a/28881708; bup with git.py uses https://github.com/bup/bup/blob/master/lib/bup/git.py#L208 `SHA1(type + " " + + "\0" + )` where type probably one of `_typemap[type]`: #L26 'blob':3, 'tree':2, 'commit':1, 'tag':4. Can you make prefixed sha1 calc as in http://stackoverflow.com/a/37264486 (in bash +`sha1sum`, not by `git hash-object`) and test with different types? – osgx Jul 18 '16 at 07:31
  • @osgx git's hash calculation method `Commit Hash (SHA1) = SHA1("blob" + " " + + "\0" + )` applies **only to objects of type blob**. The general formula is `SHA1(object_type + " " + + "\0" + )`, and it is exactly replicated in `bup`. – Leon Jul 18 '16 at 08:13
  • 1
    @osgx git's hash computation code: https://github.com/git/git/blob/master/sha1_file.c#L2938 – Leon Jul 18 '16 at 08:39