5

I have the following functions to generate md5-hashes for files. The functions work great for small files, but crashes and generate Run-time error 7 - Out of memory when I try to hash files over ~250 MB (I don't actually know at which exact size it breaks, but files below 200 MB work fine).

I don't understand why it breaks at a certain size, so if anyone could shed some light on that I would appreciate it a lot.

Also, is there anything I can do to make the functions handle larger files? I intend to use the functions in a larger tool where I will need to generate hashes for files of unknown sizes. Most will be small enough for the current functions to work, but I will have to be able to handle large files as well.

I got my current functions from the most upvoted answer this post How to get the MD5 hex hash for a file using VBA?

Public Function FileToMD5Hex(ByVal strFileName As String) As String
Dim varEnc           As Variant
Dim varBytes         As Variant
Dim strOut           As String
Dim intPos           As Integer

Set varEnc = CreateObject("System.Security.Cryptography.MD5CryptoServiceProvider")

'Convert the string to a byte array and hash it
varBytes = GetFileBytes(strFileName)
varBytes = varEnc.ComputeHash_2((varBytes))

'Convert the byte array to a hex string
For intPos = 1 To LenB(varBytes)
   strOut = strOut & LCase(Right("0" & Hex(AscB(MidB(varBytes, intPos, 1))), 2))
Next

FileToMD5Hex = strOut

Set varEnc = Nothing

End Function

Private Function GetFileBytes(ByVal strPath As String) As Byte()
Dim lngFileNum          As Long
Dim bytRtnVal()         As Byte

lngFileNum = FreeFile

'If file exists, get number of bytes
If LenB(Dir(strPath)) Then
   Open strPath For Binary Access Read As lngFileNum
   ReDim bytRtnVal(LOF(lngFileNum)) As Byte
   Get lngFileNum, , bytRtnVal
   Close lngFileNum
Else
   MsgBox "Filen finns inte" & vbCrLf & "Avbryter", vbCritical, "Filen hittades inte"
   Exit Function
End If

GetFileBytes = bytRtnVal
Erase bytRtnVal

End Function

Thank you

Community
  • 1
  • 1
mejchei
  • 63
  • 1
  • 3
  • 1
    have you tried changing `intPos` to a `Long` instead? Integers are 16-bit signed in VBA and so are limited to 32,767. It's a long-shot as I would expect this to cause an `overflow` error if it were the problem but worth a try nontheless – SierraOscar Mar 31 '16 at 09:21
  • @Macro Man Thank you for your suggestion. The problem occurs before intPos comes into play. The error appears at `GetFileBytes = bytRtnVal` in `Private Function GetFileBytes(ByVal strPath As String) As Byte()`. – mejchei Mar 31 '16 at 10:24

2 Answers2

7

It looks like you reached the memory limit. A better way would be to compute the MD5 of the file by block:

Public Function ComputeMD5(filepath As String) As String
  Dim buffer() As Byte, svc As Object, hFile%, blockSize&, i&
  blockSize = 2 ^ 16

  ' open the file '

  If Len(Dir(filepath)) Then Else Err.Raise 5, , "file not found" & vbCr & filepath

  hFile = FreeFile
  Open filepath For Binary Access Read As hFile

  ' allocate buffer '

  If LOF(hFile) < blockSize Then blockSize = ((LOF(hFile) + 1024) \ 1024) * 1024
  ReDim buffer(0 To blockSize - 1)

  ' compute hash '

  Set svc = CreateObject("System.Security.Cryptography.MD5CryptoServiceProvider")

  For i = 1 To LOF(hFile) \ blockSize
    Get hFile, , buffer
    svc.TransformBlock buffer, 0, blockSize, buffer, 0
  Next

  Get hFile, , buffer
  svc.TransformFinalBlock buffer, 0, LOF(hFile) Mod blockSize
  buffer = svc.Hash

  ' cleanup '

  svc.Clear
  Close hFile

  ' convert to an hexa string '

  ComputeMD5 = String$(32, "0")

  For i = 0 To 15
     Mid$(ComputeMD5, i + i + 2 + (buffer(i) > 15)) = Hex(buffer(i))
  Next

End Function
Florent B.
  • 41,537
  • 7
  • 86
  • 101
  • @ Florent B. Wow, thanks! That worked like charm! Out of curiosity, where do the values for the upper bound of the array block() come from? What do they mean? – mejchei Mar 31 '16 at 11:39
  • The Open file from excel has a max buffer of 32,767 bytes (1024 * 32 -1), so I choose a bit less: 31744 bytes (1024 * 31). – Florent B. Mar 31 '16 at 11:52
  • @ Florent B. I see. Thanks again! :) – mejchei Mar 31 '16 at 12:01
  • 1
    Thank you, I am very pleased to have found working code a block-by-block MD5 file loader! A quick tip: you don't need to specify `Len = 31744` in your `Open For Binary Access Read' statement, and a 1MByte block size is better on a slow network (fewer 'hits' to the file). It turns out that the `Len` parameter is the only part of the `Open` statement with a Small Int limit of 32767 :) – Nigel Heffernan Dec 29 '17 at 12:47
  • 1
    Another small point: you don't need to use `FileLen(filepath)` - it's just one more 'hit' to the file, with all the associated network and disk lag; immediately after the `Open` statement, `Length = LOF(#1)` can read the byte count behind #1, the open file handle. – Nigel Heffernan Dec 29 '17 at 12:52
  • Can I ask why you've opted to use the old-school type suffixes on hFile%, blockSize& and i& instead of declaring the types explicitly as any other in VBA? – FrugalTPH Apr 04 '18 at 13:19
  • @FrugalTPH, I just find it easier to read once used to it. I wouldn't consider it old school since it's still well documented [Type characters](https://learn.microsoft.com/en-us/dotnet/visual-basic/programming-guide/language-features/data-types/type-characters) and sometimes necessary to avoid and overflow (i.e `... * 256&`). Though it's just my personal opinion. – Florent B. Apr 04 '18 at 13:40
  • Any clue on how to get this working for files greater than 2GB? – FrugalTPH Apr 30 '18 at 18:05
  • @FrugalTPH, `LOF` (length of the file) returns a `Long` which can hold a maximum of `2147483647` bytes. Thus if you want to get it working with a file over 2gb, you'll have to find another way to get the file size and to store it in a Double or Currency. See http://www.vbforums.com/showthread.php?531321-VB6-Huge-(-gt-2GB)-File-I-O-Class – Florent B. Apr 30 '18 at 18:25
  • @FlorentB. I've allowed for that but getting incorrect MD5's out of the function as a result. I've just posted a follow-up answer below explaining further. – FrugalTPH May 02 '18 at 09:40
0

This is an extension to FlorentB's answer, which worked brilliantly for me until my files surpassed the 2GB LOF() size limit.

I tried to adapt for getting file length by alternate means as follows:

Public Function ComputeMD5(filepath As String) As String
    If Len(Dir(filepath)) Then Else Err.Raise 5, , "File not found." & vbCr & filepath

    Dim blockSize As Long: blockSize = 2 ^ 20
    Dim blockSize_f As Double
    Dim buffer() As Byte
    Dim fileLength As Variant
    Dim hFile As Integer
    Dim n_Reads As Long
    Dim i As Long
    Dim svc As Object: Set svc = CreateObject("System.Security.Cryptography.MD5CryptoServiceProvider")

    fileLength = DecGetFileSize(filepath)
    If fileLength < blockSize Then blockSize = ((fileLength + 1024) \ 1024) * 1024
    ReDim buffer(0 To blockSize - 1)
    n_Reads = fileLength / blockSize
    blockSize_f = fileLength - (CDbl(blockSize) * n_Reads)

    hFile = FreeFile
    Open filepath For Binary Access Read As hFile
    For i = 1 To n_Reads
        Get hFile, i, buffer
        svc.TransformBlock buffer, 0, blockSize, buffer, 0
    Next i

    Get hFile, i, buffer
    svc.TransformFinalBlock buffer, 0, blockSize_f
    buffer = svc.Hash
    svc.Clear
    Close hFile

    ComputeMD5 = String$(32, "0")
    For i = 0 To 15
        Mid$(ComputeMD5, i + i + 2 + (buffer(i) > 15)) = Hex(buffer(i))
    Next

End Function

Public Function DecGetFileSize(fname As String) As Variant
    Dim fso As New FileSystemObject
    Dim f: Set f = fso.GetFile(fname)
    DecGetFileSize = CDec(f.Size)
    Set f = Nothing
    Set fso = Nothing
End Function

This all runs fine, returning a string, however that string does not equal the MD5 calculated using other tools on the same file.

I can't work out where the discrepancy is originating.

I've checked and double checked filelength, n_reads, blockSize and blockSize_f and I'm sure those values are all correct.

I had some trouble with the Get function, where if I didn't explicitly tell it the block number, it dies at block 2048.

Any ideas / pointers would be much appreciated.

FrugalTPH
  • 533
  • 2
  • 6
  • 25
  • 1
    The issue is likely due to the implicit rounding. Declare `fileLength` as a Double and replace with `n_Reads = Fix(fileLength / blockSize)` and `fileLength - Fix(CDbl(blockSize) * n_Reads)` – Florent B. May 02 '18 at 12:27
  • @FlorentB. Thanks for that, but still not working. I believe it is definitely something to do with the Get function overflowing at the 2048th read. With my block size of 2^20 multiplied by 2048 = 2147483648, which is the limit on Long datatype. – FrugalTPH May 03 '18 at 12:05
  • I can't seem to find a way round it using Get, have seen suggestions to use FileSystemObject read line (which I don't think will work with Binary), and also seen that reading as a ADODB binary stream could be an option. I'm thinking the latter seems the best option, but am now struggling with chunking the data through the ADODB stream reader. – FrugalTPH May 03 '18 at 12:07