Pearson hash 8-bit implementation is producing very non-uniform values

Question

I am implementing a pearson hash in order to create a lightweight dictionary structure for a C project which requires a table of files names paired with file data - I want the nice constant search property of hash tables. I'm no math expert so I looked up good text hashes and pearson came up, with it being claimed to be effective and having a good distribution. I tested my implementation and found that no matter how I vary the table size or the filename max length, the hash is very inefficient, with for example 18/50 buckets being left empty. I trust wikipedia to not be lying, and yes I am aware I can just download a third party hash table implementation, but I would dearly like to know why my version isn't working.

In the following code, (a function to insert values into the table), "csString" is the filename, the string to be hashed, "cLen" is the length of the string, "pData" is a pointer to some data which is inserted into the table, and "pTable" is the table struct. The initial condition cHash = cLen - csString[0] is somethin I experimentally found to marginally improve uniformity. I should add that I am testing the table with entirely randomised strings (using rand() to generate ascii values) with randomised length between a certain range - this is in order to easily generate and test large amounts of values.

typedef struct StaticStrTable {
    unsigned int nRepeats;
    unsigned char nBuckets;
    unsigned char nMaxCollisions;

    void** pBuckets;
} StaticStrTable;

static const char cPerm256[256] = {
    227, 117, 238, 33, 25, 165, 107, 226, 132, 88, 84, 68, 217, 237, 228, 58, 52, 147, 46, 197, 191, 119, 211, 0, 218, 139, 196, 153, 170, 77, 175, 22, 193, 83, 66, 182, 151, 99, 11, 144, 104, 233, 166, 34, 177, 14, 194, 51, 30, 121, 102, 49,
    222, 210, 199, 122, 235, 72, 13, 156, 38, 145, 137, 78, 65, 176, 94, 163, 95, 59, 92, 114, 243, 204, 224, 43, 185, 168, 244, 203, 28, 124, 248, 105, 10, 87, 115, 161, 138, 223, 108, 192, 6, 186, 101, 16, 39, 134, 123, 200, 190, 195, 178,
    164, 9, 251, 245, 73, 162, 71, 7, 239, 62, 69, 209, 159, 3, 45, 247, 19, 174, 149, 61, 57, 146, 234, 189, 15, 202, 89, 111, 207, 31, 127, 215, 198, 231, 4, 181, 154, 64, 125, 24, 93, 152, 37, 116, 160, 113, 169, 255, 44, 36, 70, 225, 79,
    250, 12, 229, 230, 76, 167, 118, 232, 142, 212, 98, 82, 252, 130, 23, 29, 236, 86, 240, 32, 90, 67, 126, 8, 133, 85, 20, 63, 47, 150, 135, 100, 103, 173, 184, 48, 143, 42, 54, 129, 242, 18, 187, 106, 254, 53, 120, 205, 155, 216, 219, 172,
    21, 253, 5, 221, 40, 27, 2, 179, 74, 17, 55, 183, 56, 50, 110, 201, 109, 249, 128, 112, 75, 220, 214, 140, 246, 213, 136, 148, 97, 35, 241, 60, 188, 180, 206, 80, 91, 96, 157, 81, 171, 141, 131, 158, 1, 208, 26, 41
};

void InsertStaticStrTable(char* csString, unsigned char cLen, void* pData, StaticStrTable* pTable) {
    unsigned char cHash = cLen - csString[0];

    for (int i = 0; i < cLen; ++i) cHash ^= cPerm256[cHash ^ csString[i]];
    
    unsigned short cTableIndex = cHash % pTable->nBuckets;
    long long* pBucket = pTable->pBuckets[cTableIndex];
    
    // Inserts data and records how many collisions there are - it may look weird as the way in which I decided to pack the data into the table buffer is very compact and arbitrary 
    // It won't affect the hash though, which is the key issue!

    for (int i = 0; i < pTable->nMaxCollisions; ++i) {
        if (i == 1) {
            pTable->nRepeats++;
        }

        long long* pSlotID = pBucket + (i << 1);

        if (pSlotID[0] == 0) {
            pSlotID[0] = csString;
            pSlotID[1] = pData;

            break;
        }
    }
}

An 8-bit hash is basically junk. Why not use one that provides a wider range of output values? You'd want one that's 2^32 or better these days. — tadman, Jan 20 '21 at 19:16
@tadman because I don’t need or want the wasted space of having a table with >256 buckets — FShrike, Jan 20 '21 at 19:16
It says on the Wikipedia page it's "designed for 8-bit registers" which is, specifically, things like the 6502 or 8088, chips that have been replaced by 16, 32 and 64-bit successors almost 40 years ago. This is not something relevant to a modern CPU — tadman, Jan 20 '21 at 19:17
You can always modulo vs. a relatively prime number and you're fine. Larger hashes provide more uniformity due to the Pigeonhole Principle etc. — tadman, Jan 20 '21 at 19:18
How many entries were inserted to the hash table when 18/50 ratio was observed? — tstanisl, Jan 20 '21 at 19:18
I entered 50; note that that terrible collision ratio is observed with small and large table sizes and entry sizes - even when the table size dwarfs how many entries there are there is poor efficiency — FShrike, Jan 20 '21 at 19:19
As a test try with a more uniform hash like SHA2-256 even if it's super inefficient. These hashes are better understood than some anachronism like a Pearson Hash. — tadman, Jan 20 '21 at 19:19
@tadman Why does the fact that my computer is capable of 64-bit operations invalidate 8-bit ones? If it worked back then it would still work now, and faster too. I won't ever need >256 buckets or data entries for what I'm going to use this for, so the "uniform" 8-bit distribution of a pearson hash should still suffice — FShrike, Jan 20 '21 at 19:22
are you saying my implementation is fine, and the algorithm just not as good as is claimed even for small sizes? — FShrike, Jan 20 '21 at 19:22
@tstanisl no - they are entirely randomised, no extensions - Im testing its thoroughness empirically before Im actually using it with my intended use case — FShrike, Jan 20 '21 at 19:28
A lot of these hashes for 8-bit machines were extremely limited but "good enough" considering the alternatives. Just use a modern hash and stop stressing about this. Lots of [examples here](https://stackoverflow.com/questions/7666509/hash-function-for-string). Many of these take advantage of how modern CPUs can execute more than one op per cycle while your old 6502 struggled to do even one op per cycle. — tadman, Jan 20 '21 at 19:30
@tadman ok thank you. I thought 256 values would be more than enough variation. How would I use a set of 2^32 values to provide useful variation across my very limited use space - would: (32-bit hash) mod (20 entries or thereabouts) still give good statistical properties? — FShrike, Jan 20 '21 at 19:44
Again, **do not use non-prime numbers**. Anything with common factors will cause collisions. You want to use primes whenever possible as they have only factor, themselves. Using 20 is awful because it has factors 5, 4, and 2. 60 is even worse, since that includes 3 as well. Maybe the real problem here is using bad non-prime bucket sizes with lots of factors. See [answers like this](https://stackoverflow.com/questions/3980117/hash-table-why-size-should-be-prime) or [this](https://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus) for more notes. — tadman, Jan 20 '21 at 19:49
If you distribute 50 items randomly over 50 bins, the *appriximate* expected amount of empty bins is 17 (plus 16 bins with 1 elem, 14 with 2 elems, 6 bins with 3 elems) [note: random is a good approximation for an optimal hash function] — wildplasser, Jan 20 '21 at 19:54
I did a little testing. When the hash is applied to 50 random strings, the full range (0-255) hash result has only a few collisions, typically 3 or 4 (worst was 9 in my limited testing). That seems consistent with the birthday paradox. When you apply the modulo (`% 50`), the collision rate increases to about 15 to 19 (the highest I saw was 26). The modulo reduces the number of bins by a factor of five, so not surprisingly, the collision rate goes up. — user3386109, Jan 20 '21 at 20:19
@tadman: Re “Larger hashes provide more uniformity due to the Pigeonhole Principle etc”: You may be confusing uniformity with uniqueness or some other property. Uniformity is about a “level” distribution, so that the preimages of potential outputs are approximately the same size (the number of inputs that map to some output x are about equal to the number of inputs that map to some output y). The Pigeonhole Principle is about uniqueness; outputs cannot be unique if there are more input possibilities than output possibilities. — Eric Postpischil, Jan 20 '21 at 20:40
@EricPostpischil I mean this in a broader sense vis-a-vis collision-free hashes which are often a design goal. In this case collisions are inevitable, but that's often far less desirable due to how it escalates insertion and retrieval times. — tadman, Jan 20 '21 at 20:48
@wildplasser how can that be? Surely each bin having an equal chance would receive approximately equal numbers of items in its buckets - uniform distribution, not a bell curve or anything like that - or am I mathematically uneducated? — FShrike, Jan 20 '21 at 20:50
I just ran a simulation. Figures are approximate. Pull 50 random numbers take their modulo 50 and increment the corresponding bin-count. Aftermath: make a histogram of the counts inside the bins. — wildplasser, Jan 20 '21 at 20:56
@IdioticShrike Read about the [birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem). Basically, if you randomly choose 23 numbers in the range 1 to 365, there's a 50% chance that two of those numbers will be the same. You are choosing 50 random numbers from 50 possible values. So the number of collisions is expected to be high, and each collision reduces the number of bins used by 1. — user3386109, Jan 20 '21 at 20:59

score 1 · Accepted Answer · answered Jan 20 '21 at 21:15

FYI (This is not an answer, I just need the formatting) These are just single runs from a simulation, YMMV.

distributing 50 elements randomly over 50 bins:

kalender_size=50 nperson = 50
E/cell| Ncell | frac   |  Nelem   |  frac  |h/cell|  hops  | Cumhops
----+---------+--------+----------+--------+------+--------+--------
  0:       18 (0.360000)        0 (0.000000)     0        0        0
  1:       18 (0.360000)       18 (0.360000)     1       18       18
  2:       10 (0.200000)       20 (0.400000)     3       30       48
  3:        4 (0.080000)       12 (0.240000)     6       24       72
----+---------+--------+----------+--------+------+--------+--------
  4:       50                  50                1.440000         72

Similarly: distribute 365 persons over a birthday-calendar (ignoring leap days ...):

kalender_size=356 nperson = 356
E/cell| Ncell | frac   |  Nelem   |  frac  |h/cell|  hops  | Cumhops
----+---------+--------+----------+--------+------+--------+--------
  0:      129 (0.362360)        0 (0.000000)     0        0        0
  1:      132 (0.370787)      132 (0.370787)     1      132      132
  2:       69 (0.193820)      138 (0.387640)     3      207      339
  3:       19 (0.053371)       57 (0.160112)     6      114      453
  4:        6 (0.016854)       24 (0.067416)    10       60      513
  5:        1 (0.002809)        5 (0.014045)    15       15      528
----+---------+--------+----------+--------+------+--------+--------
  6:      356                 356                1.483146        528

For N items over N slots, the expectation for the number of empty slots and the number of slots with a single item in them is equal. The expected density is 1/e for both.

The final number (1.483146) is the number of ->next pointer traversels per found element (when using a chained hash table) Any optimal hash function will almost reach 1.5.

Thank you for taking the time to make this. I will study it! I am bamboozled by how 129:365 days are empty but I’ll make sense of it eventually. Many thanks — FShrike, Jan 20 '21 at 21:59

Pearson hash 8-bit implementation is producing very non-uniform values

1 Answers1