3

I want to calculate the SHA256 value of a file, which has a size of more than 1M. In order to get this hash value with the mbedtls library, I need to copy the whole file to the memory. But my memory size is only 100K. So I want to know if there is some method that calculates the file hash value in sections.

Jade
  • 289
  • 4
  • 16

1 Answers1

8

In order to get this hash value with mbedtls library, I need to copy the whole file to the memory.

This is not accurate. The mbedtls library supports incremental calculation of hash values.

To calculate a SHA-256 hash with mbedtls, you would have to take the following steps (reference):

  • Create an instance of the mbedtls_sha256_context struct.
  • Initialize the context with mbedtls_sha256_init and then mbedtls_sha256_starts_ret.
  • Feed data into the hash function with mbedtls_sha256_update_ret.
  • Calculcate the final hash sum with mbedtls_sha256_finish_ret.
  • Free the context with mbedtls_sha256_free

Note that this does not mean that the mbedtls_sha256_context struct holds the entire data until mbedtls_sha256_finish_ret is called. Instead, mbedtls_sha256_context only holds the intermediate result of the hash calculation. When feeding additional data into the hash function with mbedtls_sha256_update_ret, the state of the calculation is updated and the new intermediate result is stored in the mbedtls_sha256_context.

The total size of a mbedtls_sha256_context, as determined by sizeof( mbedtls_sha256_context), is 108 bytes on my system. We can also see this from the mbedtls source code (reference):

typedef struct mbedtls_sha256_context
{
    uint32_t total[2];          /*!< The number of Bytes processed.  */
    uint32_t state[8];          /*!< The intermediate digest state.  */
    unsigned char buffer[64];   /*!< The data block being processed. */
    int is224;                  /*!< Determines which function to use:
                                     0: Use SHA-256, or 1: Use SHA-224. */
}
mbedtls_sha256_context;

We can see that the struct holds a counter of size 2*32 bit = 8 byte that keeps track of the total number of bytes processed so far. 8*32 bit = 32 byte are used to track the intermediate result of the hash calculation. 64 byte are used to track the current data block being processed. As you can see, this is a fixed size buffer that does not grow with the amount of data that is being hashed. Finally an int is used to distinguish between SHA-224 and SHA-256. On my system sizeof(int) == 4. So in total, we get the 8+32+64+4 = 108 byte.

Consider the following example program, which reads a file step by step into a buffer of size 4096 and feeds the buffer into the hash function in each step:

#include <mbedtls/sha256.h>

#include <stdio.h>
#include <stdlib.h>

#define BUFFER_SIZE 4096
#define HASH_SIZE 32

int main(void) {
  int ret;

  // Initialize hash
  mbedtls_sha256_context ctx;
  mbedtls_sha256_init(&ctx);
  mbedtls_sha256_starts_ret(&ctx, /*is224=*/0);

  // Open file
  FILE *fp = fopen("large_file", "r");
  if (fp == NULL) {
    ret = EXIT_FAILURE;
    goto exit;
  }

  // Read file in chunks of size BUFFER_SIZE
  uint8_t buffer[BUFFER_SIZE];
  size_t read;
  while ((read = fread(buffer, 1, BUFFER_SIZE, fp)) > 0) {
    mbedtls_sha256_update_ret(&ctx, buffer, read);
  }

  // Calculate final hash sum
  uint8_t hash[HASH_SIZE];
  mbedtls_sha256_finish_ret(&ctx, hash);

  // Simple debug printing. Use MBEDTLS_SSL_DEBUG_BUF in a real program.
  for (size_t i = 0; i < HASH_SIZE; i++) {
    printf("%02x", hash[i]);
  }
  printf("\n");

  // Cleanup
  fclose(fp);
  ret = EXIT_SUCCESS;

exit:
  mbedtls_sha256_free(&ctx);
  return ret;
}

When running a program on a large sample file, the following behavior can be observed:

$ dd if=/dev/random of=large_file bs=1024 count=1000000
1000000+0 records in
1000000+0 records out
1024000000 bytes (1.0 GB, 977 MiB) copied, 5.78353 s, 177 MB/s
$ sha256sum large_file 
ae2d3b46eec018e006533da47a80e933a741a8b1320cfce7392a5472faae0216  large_file
$ gcc -O3 -static test.c /usr/lib/libmbedcrypto.a
$ ./a.out 
ae2d3b46eec018e006533da47a80e933a741a8b1320cfce7392a5472faae0216

We can see that the program calculates the correct SHA-256 hash. We can also inspect the memory used by the program:

$ command time -v ./a.out
...
Maximum resident set size (kbytes): 824
...

We can see that the program consumed at most 824 KB of memory. Thus, we have calculated the hash of a 1 GB file with < 1MB of memory. This shows that we do not have to load the entire file into memory at once to calculate its hash with mbedtls.

Keep in mind this measurement was done on a 64 bit desktop computer, not an embedded platform. Also, no further optimizations were performed besides -O3 and static linking (the latter approximately halved the memory usage of the program). I would expect the memory footprint to be even smaller on an embedded device with a smaller address size and a tool chain performing further optimizations.