3

I've been reading this article because it was linked by Jon Skeet on this answer. I'm trying to really get an understanding of how hashing works and why Jon likes the algorithm he provided so much. I'm not claiming to have an answer to that yet, but I do have a specific question about the base System.String implementation of GetHashCode.

Consider the code, focusing on the annotated <<<<<========== line:

public override unsafe int GetHashCode()
{
  if (HashHelpers.s_UseRandomizedStringHashing)
    return string.InternalMarvin32HashString(this, this.Length, 0L);
  fixed (char* chPtr = this)
  {
    int num1 = 352654597;
    int num2 = num1;
    int* numPtr = (int*) chPtr;
    int length = this.Length;
    while (length > 2)
    {
      num1 = (num1 << 5) + num1 + (num1 >> 27) ^ *numPtr;
      num2 = (num2 << 5) + num2 + (num2 >> 27) ^ numPtr[1];
      numPtr += 2;
      length -= 4;   <<<<<==========
    }
    if (length > 0)
      num1 = (num1 << 5) + num1 + (num1 >> 27) ^ *numPtr;
    return num1 + num2 * 1566083941;
  }
}

Why do they only process every fourth character? And, if you're willing enough, why do they process it from right to left?

Community
  • 1
  • 1
Mike Perrenoud
  • 66,820
  • 29
  • 157
  • 232
  • 3
    Loop unrolling. I explicitly covered this detail in [this answer](http://stackoverflow.com/a/15176541/17034). – Hans Passant Dec 04 '13 at 18:29
  • As a sidenote: in his book C# In depth, third edition he simply multiplies the hashcode of the first item by 37 and add the hashcode of the second item (in a class with 2 fields). He refers to Effective Java 2 for more information though, which I sadly don't have. – Jeroen Vannevel Dec 04 '13 at 18:29
  • @HansPassant, I'd have to say that the first line, `This is possibly more than you bargained for ...`, is true from the code's perspective. However, you're answer is absurdly phenomenal! It really has helped me better understand this operation, and I apologize for duplicating it a little. – Mike Perrenoud Dec 04 '13 at 18:43

3 Answers3

6

Why do they only process every fourth character? And, if you're willing enough, why do they process it from right to left?

They're not doing either. They're processing the characters as pairs of integer values (note that they use *numPtr and numPtr[1] in the while loop). Two Int32 values takes the same space as 4 characters, which is why they're subtracting 4 from the length each time.

This is processed from front to back (in array order), but length is decremented since it's representing the length of the string remaining to process. This means they're processing from left to right in "blocks of 4 characters" at a time while possible.

Reed Copsey
  • 554,122
  • 78
  • 1,158
  • 1,373
  • I'm trying to get a good grasp on this, and I [found this post](http://www.readability.com/articles/zzr4pe1b). Am I correct that this is basically what they are doing here? I know the post is in `C`, but am I headed down the right path to get an even better understanding? – Mike Perrenoud Dec 04 '13 at 19:46
  • @MichaelPerrenoud Yep - pointer math is not common in C# (requires `unsafe`), but you can do it when necessary. – Reed Copsey Dec 04 '13 at 20:15
  • Thanks a lot! I can see I have a real deficiency here so I'm going to do some reading and build some examples. I may have some follow up questions as I start building example that you may see. – Mike Perrenoud Dec 04 '13 at 20:18
  • 1
    @MichaelPerrenoud It's not really a deficiency if you're a managed developer - it's very, very rare that using unsafe and manipulating pointers is a good idea ;) – Reed Copsey Dec 04 '13 at 20:22
  • LOL, that's a very good point! I guess I just have room to grow in these lower level parts of computer science! – Mike Perrenoud Dec 04 '13 at 20:24
4

It does not process every fourth character. That's because of this line:

int* numPtr = (int*) chPtr;

It changes the pointer type to int*, what makes it process two chars every time numPtr is used. And because it's used twice every loop iteration:

num1 = (num1 << 5) + num1 + (num1 >> 27) ^ *numPtr;
num2 = (num2 << 5) + num2 + (num2 >> 27) ^ numPtr[1];

it takes 4 characters every time.

MarcinJuraszek
  • 124,003
  • 15
  • 196
  • 263
3

numPtr is a pointer to a 32-bit integer.
Each iteration of the loop processes two 32-bit integers (*numPtr and numPtr[1]).

Therefore, it ends with numPtr += 2 (skip 2 32-bit chunks) and length -= 4 (we just finished 4 16-bit chars).

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • So this operation, `(int*) chPtr`, turns the entire `char[]` into a 32-bit integer? Or rather, a pointer to a 32-bit integer? And why does it have to be a pointer? Man, it's been a *long time* since I've used pointers (ehem, like 10 years). – Mike Perrenoud Dec 04 '13 at 18:37
  • @MichaelPerrenoud: No. Pointers point to raw memory. Casting an `X*` to a `Y*` tells the compiler to treat that raw memory as a `Y` instead of an `X`. This is not necessarily safe. – SLaks Dec 04 '13 at 18:42
  • It's faster when you use pointers. And without pointers you can't just cast `char[]` to `int[]`. – MarcinJuraszek Dec 04 '13 at 18:43
  • @SLaks or @Marcin, what material could I read to help me better understand pointers? Further, why casting `char*` to `int*` takes two characters? I believe I understand it at a very rudimentary level, but I want a better understanding and don't want to waste your time as you clearly have a very deep understanding. – Mike Perrenoud Dec 04 '13 at 18:45
  • @MichaelPerrenoud: `char` is 2 bytes wide. `int` is 4 bytes wide. Thus, an `int*` refers to the first **4** bytes at that address. – SLaks Dec 04 '13 at 18:52
  • @SLaks, I think I'm getting where you guys are going with this. We aren't addressing values, we're addressing memory locations with the addition and indexing. Is that right? – Mike Perrenoud Dec 04 '13 at 18:53
  • `char[]` is just a peace of memory. Casting it to `char*` gives you pointer to the very first byte in that memory, and because `char` uses 2 bytes, reading from that memory using `char*` reads 2 bytes. But when you cast the pointer to `int*` it still points to the same first byte of data, but because `int` uses 4 bytes you're now reading it by 4 bytes a time. – MarcinJuraszek Dec 04 '13 at 18:53
  • @MarcinJuraszek, that makes perfect sense. I'm so very sorry to have shown great ignorance here, but I do appreciate my peers working with me! Now I understand `numPtr += 2`. It's moving forward 2 bytes in the memory address. In a way it's moving forward in the array, but it's a trick to process 4 `char`'s at a time. – Mike Perrenoud Dec 04 '13 at 18:58
  • @MichaelPerrenoud: It doesn't move 2 bytes; it moves to `int`s (= 8 bytes), because that's what the pointer points to. – SLaks Dec 04 '13 at 18:59
  • @MichaelPerrenoud `numPtr += 2` is moving forward 8 bytes! That's because `numPtr` is `int*`, so it's `2*4bytes`. If it was `char*` it would move forward `2*2bytes=4bytes`. – MarcinJuraszek Dec 04 '13 at 18:59
  • @SLaks, thank you. Yes, 8 bytes because it's an `int*`. Ugh, I'm very sorry guys. – Mike Perrenoud Dec 04 '13 at 18:59