4

I am on a mission to eliminate all (or as many as I can) allocations to the Large Object Heap as possible in my applications. One of the biggest offenders is our code that computes the MD5 hash of a large string.

public static string MD5Hash(this string s)
{
    using (MD5CryptoServiceProvider csp = new MD5CryptoServiceProvider())
    {
         byte[] bytesToHash = Encoding.UTF8.GetBytes(s);
         byte[] hashBytes = csp.ComputeHash(bytesToHash);
         return Convert.ToBase64String(hashBytes);
    }
 }

Leave for the sake of the example that the string itself is probably already in the LOH. Our goal is to prevent more allocations to the heap.

Also, the current implementation assumes UTF8 encoding (a big assumption), but really the goal is to generate a byte[] from a string.

The MD5CryptoServiceProvider can take a Stream as input, so we can create a method:

public static string MD5Hash(this Stream stream)
{
    using (MD5CryptoServiceProvider csp = new MD5CryptoServiceProvider())
    {
         return Convert.ToBase64String(csp.ComputeHash(stream));
    }
}

This is promising because we don't need a byte[] for ComputeHash to work. We need a stream object that will read bytes from a string as bytes are requested by ComputeHash.

This rather controvesial question provides a method for creating a byte array from a string regardless of encoding. However, we want to avoid the creation of a large byte array.

This question provides a method of creating a stream from a string by reading the string into a MemoryStream, but internally that is just allocating a large byte[] array as well.

Neither really do the trick.

So how can you avoid the allocation of a large byte[]? Is there a Stream class that will read from another stream (or reader) as bytes are read?

Community
  • 1
  • 1
Joe Enzminger
  • 11,110
  • 3
  • 50
  • 75
  • How time critical is this MD5 calculation? If it isn't overly time critical, you could always just write the string to a (temporary) file, create a file stream and feed that to MD5CryptoServiceProvider. –  Feb 21 '15 at 01:35

2 Answers2

3

If you don't care about the encoding, then one thing that you can do to prevent any further buffer allocation is to use some unsafe code. I.e. get to the raw bytes of the string, wrap an instance of UnmanagedMemoryStream around it and feed that to the MD5 crypto calculation.

So something like this:

public static string MD5Hash(this string s)
{
    using (MD5CryptoServiceProvider csp = new MD5CryptoServiceProvider())
    {
        unsafe
        {
            fixed (char* input = s)
            {
                using (var stream = new UnmanagedMemoryStream((byte*)input, sizeof(char) * s.Length))
                    return Convert.ToBase64String(csp.ComputeHash(stream)); 
            }
        }
    }
}
Alex
  • 13,024
  • 33
  • 62
2

You can implement your own stream backed by a string.

Note that basically you only need to implement Read and Write, accordingly with the documentation (but just throw a NotSupportedException on Write since you should not write to this stream):

When you implement a derived class of Stream, you must provide implementations for the Read and Write methods. The asynchronous methods ReadAsync, WriteAsync, and CopyToAsync use the synchronous methods Read and Write in their implementations.

You probably want to also implement ReadByte:

The default implementations of ReadByte and WriteByte create a new single-element byte array, and then call your implementations of Read and Write

Source: https://msdn.microsoft.com/pt-br/library/system.io.stream%28v=vs.110%29.aspx

Filipe Borges
  • 2,712
  • 20
  • 32