8

I just found out about the AWS Glacier service and wanted to write a small Python application to upload files via the REST API. I took a look at the required headers and stumbled upon x-amz-sha256-tree-hash. I need to calculate the SHA-256 hash of the entire file as well as the hash of the parent of all hashes of each 1 MB chunk. This leads to the following tree:

AWS's SHA-256 Tree Hash procedure

(Image taken from here)

I already made a function that reads 1 MB chunks and a class which calculates their hashes on-the-fly but then I completely struggle:

In my application I made a class called chunk which takes data and calculates a hash in the __init__ method as well as holds parent and children (like a regular tree). When the user opens a file those chunks instances will be generated properly with their respective hashes (in this example that would be 7 chunk instances).

Now I have two big problems that are connected to each other:

  1. How do I build this tree in reverse? I basically need to make a new chunk for each two chunk instances on the lowest layer and calculate a hash based on those two hashes. But where do I store that parent? In the children of the parent and do reverse tree walking?
  2. How does that work with an odd number of children? If I have an algorithm that goes through each parent layer then I would miss the last (0.5 MB) chunk.

I checked out this topic on SO but that method only works with an even children count which is not always given.

Can you help me finding a way/an algorithm/an approach on how to solve this issue?

Thanks in advance!

Paul

Community
  • 1
  • 1
Paul
  • 1,295
  • 2
  • 11
  • 26
  • How do you decide which is the parent and which is the child ? – GodMan Aug 21 '12 at 15:24
  • Did you already try to hash the archive normally with hashlib? – unddoch Aug 21 '12 at 15:26
  • @GodMan What do you exactly mean? The bottom layer are the children, the hashes built from 2 chunks are the parent of those 2 chunks and so on. – Paul Aug 21 '12 at 15:27
  • @pythonm Yes, but that is only 1/2 of the required headers. I also need this SHA-256 tree in order to successfully upload files. See [here](http://docs.amazonwebservices.com/amazonglacier/latest/dev/api-common-request-headers.html) for the `x-amz-content-sha256` and the `x-amz-sha256-tree-hash`. – Paul Aug 21 '12 at 15:28
  • So, my question is whether you already have the tree built ? or you have just the hashes and you want help in building the tree using those hashes? – GodMan Aug 21 '12 at 15:29
  • @GodMan I just have the hashes. From my knowledge point I am unable to think of a way how to build that according to the given procedure with my problems in mind. (Also I only have the hashes for the chunks, I could not go farther because I am unable to build the second layer properly). – Paul Aug 21 '12 at 15:30
  • ok. I am unaware of AWS, but, dealing this problem as a algorithm question. I have 1 more doubt. If you have the hashes, how do you decide which hash becomes the parent and which becomes the child? – GodMan Aug 21 '12 at 15:32
  • Good question. I just set that in my mind. From what I learned is that you have a parent with children following, so the lowest layer has to contain the the children, right? However, switching the logic you could say that the hash I want to achieve could be the lowest children and the chunks are all parents (kind of a masterpiece family, many parents: one final product). – Paul Aug 21 '12 at 15:35
  • I am trying to ask : given a child, how do you decide the child-parent relationship ? – GodMan Aug 21 '12 at 15:39
  • Given the bottom left child (in the example): It would be the child of the hash in the level above. That hash is the parent of the bottom left child and a child of the hash above of it and so on. – Paul Aug 21 '12 at 15:49
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/15613/discussion-between-godman-and-paul-engstler) – GodMan Aug 21 '12 at 15:59
  • You can also find some code for doing this in the BOTO git repository, in the glacier branch. Here's a link to the latest version of this code as of the time of writing: https://github.com/boto/boto/blob/glacier/boto/glacier/writer.py – Aaron Sep 04 '12 at 21:26
  • Now in `botocore.utils.calculate_tree_hash` – wim Apr 13 '17 at 18:12

1 Answers1

4

First calculate the number of levels, then

def proclevel(levels):
    if levels > 0:
        generator = proclevel(levels - 1)
        temp = None
        for firsthash, secondhash in generator:
            if not temp: temp = hashofthem(firsthash, secondhash)
            else: yield temp, hashofthem(firsthash, secondhash); temp = None
        #If odd number of packets
        if temp: yield temp, None
    else:
        temp = None
        for chunk in chunks:
            if not temp: temp = hash(chunk)
            else: yield temp, hash(chunk); temp = None
        if temp: yield temp, None

Make sure to handle None as second argument in hashofthem :)

unddoch
  • 5,790
  • 1
  • 24
  • 37
  • Ah, recursion, of course *facepalm*. I'm going to try that out and will - hopefully - report success. ;) – Paul Aug 21 '12 at 15:38
  • This binary tree ASKS for recursion ;) – unddoch Aug 21 '12 at 15:40
  • It seems to have worked! Thank you very much! *you should have made the hint that the second argument could be `None` really, really bold - took me a while to figure out what went wrong ;)* – Paul Aug 21 '12 at 16:15