21

How do I get a short hash of a long string using Excel VBA?

What's given

  • Input string is not longer than 80 characters
  • Valid input characters are: [0..9] [A_Z] . _ /
  • Valid output characters are [0..9] [A_Z] [a_z] (lower and upper case can be used)
  • The output hash shouldn't be longer than ~12 characters (shorter is even better)
  • No need to be unique at all since this will result in a hash that's too long

What I have done so far

I thought this SO answer is a good start since it generates a 4-digit Hex-Code (CRC16).

But 4 digits were too few. In my test with 400 strings, 20% got a duplicate somewhere else.
The chance to generate a collision is too high.

Sub tester()
    For i = 2 To 433
        Cells(i, 2) = CRC16(Cells(i, 1))
    Next i
End Sub


Function CRC16(txt As String)
Dim x As Long
Dim mask, i, j, nC, Crc As Integer
Dim c As String

Crc = &HFFFF

For nC = 1 To Len(txt)
    j = Val("&H" + Mid(txt, nC, 2))
    Crc = Crc Xor j
    For j = 1 To 8
        mask = 0
        If Crc / 2 <> Int(Crc / 2) Then mask = &HA001
        Crc = Int(Crc / 2) And &H7FFF: Crc = Crc Xor mask
    Next j
Next nC

CRC16 = Hex$(Crc)
End Function

How to reproduce

You can copy these 400 test strings from pastebin.
Paste them to column A in a new Excel workbook and execute the code above.

Q: How do I get a string hash which is short enough (12 chars) and long enough to get a small percentage of duplicates.

Melanie Palen
  • 2,645
  • 6
  • 31
  • 50
nixda
  • 2,654
  • 12
  • 49
  • 82

5 Answers5

43

Maybe others will find this useful.

I have collected some different functions to generate a short hash of a string in VBA.
I don't take credit for the code and all sources are referenced.

enter image description here

  1. CRC16
    • Function: =CRC16HASH(A1) with this Code
    • hash is a 4 characters long HEX string
    • 19 code lines
    • 4 digits long hash = 624 collisions in 6895 lines = 9 % collision rate
  2. CRC16 numeric
    • Function: =CRC16NUMERIC(A1) with this Code
    • hash is a 5 digits long number
    • 92 code lines
    • 5 digits long hash = 616 collisions in 6895 lines = 8.9 % collision rate
  3. CRC16 twice
    • Function: =CRC16TWICE(A1) with this Code
    • hash is a 8 characters long HEX string
    • hash can be expanded to 12/16/20 etc. characters to reduce collision rate even more
    • 39 code lines
    • 8 digits long hash = 18 collisions in 6895 lines = 0.23 % collision rate
  4. SHA1
    • Function: =SHA1TRUNC(A1) with this Code
    • hash is a 40 characters long HEX string
    • 142 code lines
    • can be truncated
    • 4 digits hash = 726 collisions in 6895 lines = 10.5 % collision rate
    • 5 digits hash = 51 collisions in 6895 lines = 0.73 % collision rate
    • 6 digits hash = 0 collisions in 6895 lines = 0 % collision rate
  5. SHA1 + Base64
    • Function: =BASE64SHA1(A1) with this Code
    • hash is a 28 characters long unicode string (case sensitive + special chars)
    • 41 code lines
    • requires .NET since it uses library "Microsoft MSXML"
    • can be truncated
    • 4 digits hash = 36 collisions in 6895 lines = 0.5 % collision rate
    • 5 digits hash = 0 collisions in 6895 lines = 0 % collision rate

Here is my test workbook with all example functions and a big number of test strings.

Feel free to add own functions.

nixda
  • 2,654
  • 12
  • 49
  • 82
  • 2
    Thanks for the test worksheet -- nicely done! (For those perhaps worried about infections macros -- for what it's worth, I've downloaded the test worksheet, scanned it, used it, no problems.) – Assad Ebrahim Jun 13 '13 at 14:57
  • 1
    Your workbook is unavailable. Clicking the link yields "These files are no longer available as their owner was violating our terms of service." – Thomas Cox Nov 07 '17 at 03:32
  • @ThomasCox Fixed it, link is available again – nixda Jan 21 '19 at 07:54
  • 2
    Your link shortener appears broken. Test workbook cannot be found using the link. – ftrotter Apr 09 '21 at 10:38
  • 2
    @nixda can you fix the link again please? – Karls Apr 15 '21 at 06:44
16

Split your string into three shorter strings (if not divisible by three, the last one will be longer than the other two). Run your "short" algorithm on each, and concatenate the results.

I could write the code but based on the quality of the question I think you can take it from here!

EDIT: It turns out that that advice is not enough. There is a serious flaw in your original CRC16 code - namely the line that says:

j = Val("&H" + Mid(txt, nC, 2))

This only handles text that can be interpreted as hex values: lowercase and uppercase letters are the same, and anything after F in the alphabet is ignored (as far as I can tell). That anything good comes out at all is a miracle. If you replace the line with

j = asc(mid(txt, nC, 1))

Things work better - every ASCII code at least starts out life as its own value.

Combining this change with the proposal I made earlier, you get the following code:

Function hash12(s As String)
' create a 12 character hash from string s

Dim l As Integer, l3 As Integer
Dim s1 As String, s2 As String, s3 As String

l = Len(s)
l3 = Int(l / 3)
s1 = Mid(s, 1, l3)      ' first part
s2 = Mid(s, l3 + 1, l3) ' middle part
s3 = Mid(s, 2 * l3 + 1) ' the rest of the string...

hash12 = hash4(s1) + hash4(s2) + hash4(s3)

End Function

Function hash4(txt)
' copied from the example
Dim x As Long
Dim mask, i, j, nC, crc As Integer
Dim c As String

crc = &HFFFF

For nC = 1 To Len(txt)
    j = Asc(Mid(txt, nC)) ' <<<<<<< new line of code - makes all the difference
    ' instead of j = Val("&H" + Mid(txt, nC, 2))
    crc = crc Xor j
    For j = 1 To 8
        mask = 0
        If crc / 2 <> Int(crc / 2) Then mask = &HA001
        crc = Int(crc / 2) And &H7FFF: crc = crc Xor mask
    Next j
Next nC

c = Hex$(crc)

' <<<<< new section: make sure returned string is always 4 characters long >>>>>
' pad to always have length 4:
While Len(c) < 4
  c = "0" & c
Wend

hash4 = c

End Function

You can place this code in your spreadsheet as =hash12("A2") etc. For fun, you can also use the "new, improved" hash4 algorithm, and see how they compare. I created a pivot table to count collisions - there were none for the hash12 algorithm, and only 3 for the hash4. I'm sure you can figure out how to create hash8, ... from this. The "no need to be unique" from your question suggests that maybe the "improved" hash4 is all you need.

In principle, a four character hex should have 64k unique values - so the chance of two random strings having the same hash would be 1 in 64k. When you have 400 strings, there are 400 x 399 / 2 "possible collision pairs" ~ 80k opportunities (assuming you had highly random strings). Observing three collisions in the sample dataset is therefore not an unreasonable score. As your number of strings N goes up, the probability of collisions goes as the square of N. With the extra 32 bits of information in the hash12, you expect to see collisions when N > 20 M or so (handwaving, in-my-head-math).

You can make the hash12 code a little bit more compact, obviously - and it should be easy to see how to extend it to any length.

Oh - and one last thing. If you have RC addressing enabled, using =CRC16("string") as a spreadsheet formula gives a hard-to-track #REF error... which is why I renamed it hash4

Floris
  • 45,857
  • 6
  • 70
  • 122
  • 1
    Pal, +1 for comprehensive effort and detailed explanation! Favorited for very much likely future use. – Peter L. Feb 06 '13 at 19:56
  • +1 That helps. I played the whole day with different VBA implementations (MD5, SHA1 and Base64). I will post all stand-alone functions and include your new code if you dont mind. Maybe you are interested in other solutions since it seems you put a lot of effort in it like me :) – nixda Feb 06 '13 at 20:05
  • 1
    Re-reading this nine months later, I realize that you could get a far more unique hash when you allow all the characters (which the question did...), instead of just the digits [0-9] and [A-F]. This means the total number of 4 character hashes goes from 64k to 1.6M, reducing risk of collision to something very small. To take advantage of this you would have to change the algorithm - you could modify `hash4` to generate a 6 digit hex code, then convert to 4 character "base-36" number. Probability of collisions would drop accordingly. – Floris Nov 20 '13 at 06:14
7

32 bits hash function for strings with a low level of collision:

Public Function StrHash(text As String) As Long
    Dim i As Long
    StrHash = &H65D5BAAA

    For i = 1 To Len(text)
        StrHash = ((StrHash + AscW(Mid$(text, i, 1))) Mod 69208103) * 31&
    Next
End Function

Or as a 64 bits hash function:

Public Function StrHash64(text As String) As String
    Dim i&, h1&, h2&, c&
    h1 = &H65D5BAAA
    h2 = &H2454A5ED

    For i = 1 To Len(text)
        c = AscW(Mid$(text, i, 1))
        h1 = ((h1 + c) Mod 69208103) * 31&
        h2 = ((h2 + c) Mod 65009701) * 33&
    Next

    StrHash64 = Right("00000000" & Hex(h1), 8) & Right("00000000" & Hex(h2), 8)
End Function

Based on the FNV hash algorithm

Florent B.
  • 41,537
  • 7
  • 86
  • 101
  • I tried your function with my sample workbook linked above, but it's not working because `Private Declare Sub CopyMemory Lib "kernel32"` doesn't work on x64 systems even when I insert `PtrSafe`. May I ask you to have a look at [this workbook](http://ge.tt/3vkLOzL2/v/0) and find the error? – nixda Aug 10 '15 at 21:30
  • Ha, I find it quite the amusing coincidence that I decided to use your implementation yesterday and only 4 hours later you come back to edit your 4.5 years old answer. Also thanks for the link - I have been wondering about the hex value (initial hash). :) – Inarion Jun 14 '19 at 06:34
0

While the below is not a hash function, I've used it as a quick way to generate numeric id's that have a low collision rate over a small list (small enough to verify by inspection).

How it Works: Column A holds the strings from row 2 onward. In row 1, A1 and B1 hold an arbitrary start and end position midway in the string. The formula uses the first letter of the string and a fixed letter taken from mid-string and uses LEN() as a 'fanning function' to reduce the chance of collisions.

 =CODE(A2)*LEN(A2) + CODE(MID(A2,$A$1,$B$1))*LEN(MID(A2,$A$1,$B$1))

If strings are pulled from a database table with fixed width fields, you may need to trim the lengths:

 =CODE(TRIM(C8))*LEN(TRIM(C8))
       +CODE(MID(TRIM(C8),$A$1,1))*LEN(MID(TRIM(C8),$A$1,$B$1))
Assad Ebrahim
  • 6,234
  • 8
  • 42
  • 68
0

In recent versions of Excel (March 2022 and later), the new array formulas make it possible to create hash functions without VBA.

Here is the formula for Bernstein's djb2 hash function (see e.g. http://www.cse.yorku.ca/~oz/hash.html):

hash_djb2 = LAMBDA(v,
    MAP(
        v,
        LAMBDA(x,
            LET(
                y, VALUETOTEXT(x, 0),
                l, LEN(y),
                REDUCE(
                    5381,
                    SEQUENCE(l),
                    LAMBDA(a, j,
                        LET(
                            z, CODE(MID(y, j, 1)),
                            MOD(a * 33 + z, 2 ^ 32)
                        )
                    )
                )
            )
        )
    )
);

The output is an integer smaller than 2^32 (~4e9). It can be further shortened to 8 characters using DEC2HEX, or to 6 characters with a Base64 implementation.

AndreA
  • 295
  • 2
  • 12