0

How do HTTP caches store their requests ? Is there a commonly used protocol for caching requests or does each implementation have its own method for caching ?

EDIT : By this I mean how do the servers PHYSICALLY store cached requests, once the decision to cache has already been made.

I was looking through the functionality of some HTTP cache implementations such as polipo and found that they store (at least) part of their cache in the local file system but later found that nginx caches files/ file content (meaning there's a more efficient method for accessing cashed requests than storing them in the file system).

I was playing around with possible ideas and I tried to implement this method:

Hash request message -> store in a AVL -> access later using the hash value

This way it's simpler and reasonably effecient to search through the AVL to see whether the request has been serviced before. The AVL tree node has a pointer to the requests contents, this way they remain in main memory.

And I used this as the hash function:

static int hash( int size, request_t* bst_l) {

    unsigned long int hashval;
    int i = 0;

    // Convert our string to an integer
    while( hashval < ULONG_MAX && i < strlen( bst_l->MSG ) ) {
        hashval = hashval << 8;
        hashval += bst_l->MSG[ i ];
        i++;
    }

    return hashval % size;
}


where size is the size of the AVL tree.

From this I expected a unique hash value for every unique message . Though I keep getting similar hash values for different requests. Is this because of the (hashval % size) line ?

Is the above method a good one in terms of scalability and efficiency ? and if so does the hash function match it properly ? Or is there a more common method for hashing requests ?

  • 1
    Your hash function will always go the the full length of the request. `(hashval < ULONG_MAX)` will *always* be true unless `hashval == ULONG_MAX`. The only way it wouldn't is if you had 4 consecutive `0xff` bytes in the request string. So in most cases your hash code is just the last 4 characters of the request string. – Jim Mischel Jun 27 '19 at 00:14
  • Thanks for this. The hash val does represent the last couple of bytes of the string. I should instead hash the request line only. – Nelson Mongare Jun 27 '19 at 18:55
  • Regardless of what you hash, you should use a good [hash function](https://en.wikipedia.org/wiki/Hash_function).. The [Jenkins hash](https://en.wikipedia.org/wiki/Jenkins_hash_function), for example, is pretty good. – Jim Mischel Jun 28 '19 at 01:03

2 Answers2

2

To answer your questions:

How do HTTP caches store their requests ?

This is totally up to the client. Make sure you honor the cache headers. See this article for more information: https://www.keycdn.com/blog/http-cache-headers

Is this because of the (hashval % size) line ?

Well, yes, it only gives you size possibilities.

Is the above method a good one in terms of scalability and efficiency ? and if so does the hash function match it properly ? Or is there a more common method for hashing requests ?

No, it doesn't seem to work as you state. See this answer for a proper implementation:

https://stackoverflow.com/a/7666577/2416958


From comments:

Serverside:

It's up to the server. That's often done in various way's as well; A lot of them use a hash, and a memory storage. But that's not typical http related; it's a server implementation. Can be reddis for example.

The hash (server) is usually generated based on either; the calling url, or the domain in which it's relevant. Can be a custom string, hashed for fast access.


As for the "most effective way"; it depends. I know, its a boring answer. As for speed; an optimized structure in memory would be the fastest way to stream the data to the client. But it's often takes the largest amount of memory. So there are always a couple of things to consider.

Stefan
  • 17,448
  • 11
  • 60
  • 79
  • I might have improperly worded my question. I meant how do they physically store cached requests. – Nelson Mongare Jun 26 '19 at 18:23
  • Still, that depends on the client. Some put it on disk, some only in memory. Some structured, some not. Or do I still misunderstand? – Stefan Jun 26 '19 at 18:25
  • Naturally, the decision of whether or not to cache is affected by the headers in the request message. My question was in the event that you do decide to cache, what would be the most effective way to store the requests ? – Nelson Mongare Jun 26 '19 at 18:34
  • If you mean servers; than it's up to the server. That's often done in various way's as well; A lot of them use a hash, and a memory storage. But that's not typical http related; it's a server implementation. Can be [reddis](https://redislabs.com/solutions/use-cases/redis-for-caching/) for example. – Stefan Jun 26 '19 at 18:40
  • The hash (server) is usually generated based on either; the calling url, or the domain in which it's relevant. Can be a custom string, hashed for fast access. – Stefan Jun 26 '19 at 18:42
1

Is this because of the (hashval % size) line ?

No, of course the modulo division increases the possibility of collisions but even without using it you can get duplicate cases, a perfect hash is quite difficult to achieve, not to say impossible when the samples are random. I suggest you to find a hashmap implementation managing collisions (where every node in the hash table stores a link to the next collisioned key which you have to compare with your string)

David Ranieri
  • 39,972
  • 7
  • 52
  • 94