2

I have thought a bit about making a somewhat lightweight consistent-hashing-like PHP function to shard uploaded files between different servers.

Obviously, rand() would work to distribute the files among the servers somewhat evenly, but when requesting the files, no one will know which file lies on what server...

I know that there's some extensive libraries out there to create consistent-hashing, but I wonder how these works and how I can do to roll out my own, very lightweight one?

Note: I do not take into account that servers will be removed, but instead more ones added to the pool further on.

Update:

Here's a quick line of psuedocode:

$config['shards'] = array('192.168.1.1, 192.168.1.2');

function shard ($filename) {

    $servers = $config['shards'];

    // do lookup in some magic way to decide which server to return.

    return $appropriateserver;
}


echo shard('filename.jpg'); // returns the appropriate server to distribute the file.
Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
Industrial
  • 41,400
  • 69
  • 194
  • 289

3 Answers3

2

Well, one thing you could do would be to use crc32...

$crc = crc32($mykey);
$serverNo = $crc % count($servers);

It should be fairly consistent (meaning evenly balanced), and 100% reproducible...

ircmaxell
  • 163,128
  • 34
  • 264
  • 314
  • Hi ircmaxell. Thanks for your help! However there's going to be some issue if I add a server to the pool and thereby the count is changed – Industrial Jul 30 '10 at 10:23
  • 1
    As far as I can see, short of keeping a map there is no way to do that. One possibility would be to re-distribute the files when you add a server, but that's non-trivial as well... – ircmaxell Jul 30 '10 at 10:58
  • 1
    Actually, I thought of another way as long as you don't have a huge volume of files. Store all the files on each server. Then, just pick the server you go to based on the crc... That way you're still sharding (because at any point in time, all requests for a particular file will go through a single server), but you're also saving the problem of adding or removing a server with breaking everything... Of course this depends upon your use case, but for some it may work... – ircmaxell Jul 30 '10 at 12:49
  • Hi again! Thanks a lot for your time and valuable thoughts, ircmaxell! That's definitely one idea, but it would take the total amount of data * the number of servers to store, which would be great unless as you say there's huge amounts of data. Reads would be distributed that way at least... – Industrial Jul 30 '10 at 13:45
  • 1
    It really depends upon your use case, and your exact scenario. If you have very high traffic with low data requirements, this could be very effective. If you have high traffic with large data requirements, you may want to look into a distributed filesystem such as HDFS or MongoGrid. If you have low traffic with large data requirments, it may be better to just get a single server with a bunch of 2TB drives to handle all the requests (for now at least). The point is, the solution you come up with will need to be tailored to the needs of the application it's built for... – ircmaxell Jul 30 '10 at 14:08
  • Hi again! Yep, I can understand that. It's like eveything else - a balance depending on what you're building and what you want to accomplish. Thanks a lot for your help. I have to put in a lot of thought into this before I decide something. – Industrial Jul 31 '10 at 10:45
  • 1
    Be careful with this, on 32 bit platforms many crc32 checksums will [result in negative integers](http://php.net/manual/en/function.crc32.php). – Brian Jordan May 21 '13 at 19:37
1

I recommend using MurmurHash3: it is much faster than cryptographic hash functions, while preserves similar randomness. MurmurHash speed is close to CRC32 or even better. There is PHP implementation.

Community
  • 1
  • 1
0

an eventual solution would be:

CRC32(key) % 4 when you only have 4 servers

and when you want to rebalance you can use 2 different hash functions while migration

ex:

$server_hash1 = crc32($key) % 4
$result = $db->search($server_hash1, $key);

if ($result == false)
{
    $server_hash2 = crc32($key) % 8
    $result = $db->search($server_hash2, $key);
}
  • You have to do the same for insert/update (with a move function from config1 to config2)
  • You can do the move async (batched way)
nemenems
  • 1,064
  • 2
  • 9
  • 27