3

I have a huge number of users so I am needed to shard the databases in n shards. So to proceed with this I have below options-

  1. Divide my data in n shards basis userId modulus n operation. i.e. if I have 10 shards userId 1999 will be sent to 1999%10=9th shard
    Problem- The problem with this approach is if the number of shard increases in future reference to previous will not be maintained.

  2. I can maintain a table with UserId and ShardId
    Problem- If my users increase in future to billions I'll need this mapping table to be shared which doesn't seem to be good solution.

  3. I can maintain static mapping in code like 0-10000 in Shard 1 and more on.
    Problem-

    • With the increase in shards and Users Code needed to be changed more often.
    • If any specific User in shard has huge data It'd get difficult to separate out the shard.

So, these are the three ways I could have found but all having some problem. What would be an alternate or better approach to shard the MySQL tables which can compensate with increased number of shards and users in future.

Akash Kumar
  • 642
  • 7
  • 20

1 Answers1

3

I prefer a hybrid of 1 and 2:

  1. Hash the UserId into, say, 4096 values.
  2. Look up that number in a 'dictionary' that has shard numbers in it.

If a shard gets too full, migrate all the users with some hash number to another shard.

If you add a shard, migrate a few hash numbers to it - preferable from busy shards.

This forces you to write a script for moving users, and make it robust. Once you have that, a lot of other admin tasks become 'simple':

  • Retire a machine
  • Upgrade the OS (one by one across shards)
  • Upgrade whatever software is on the machines
  • Migrate a hash number that is bulky but not busy to a old, slow, shard that has a big disk. Similarly migrate small and busy to a shard with more cores and faster disks.

Each shard could be an HA cluster (Galera, Group replication, etc) of servers for both reliability and read-scaling. (Sharding gives you write-scaling.

There would need to be a way to distribute the dictionary to all clients "promptly".

All of this works well if you have, say, each hash in 3 different shards for HA. Each of the 3 would be at geographic locations for robustness. The dictionary would have 4 columns to say where the copies are. The 4th would be used during migrations.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • migration is a heavy operation, In production environment is it preferable? – Akash Kumar May 15 '19 at 07:34
  • 1
    @AkashKumar - Preferable to what? If one shard's disk is full, you have to do something, and that something will be a 'heavy operation'. I suggest that having a script to move one small group of users satisfies that need, plus many other needs. And you can focus on ways to make it less "heavy", such as with the 4th column. – Rick James May 15 '19 at 16:10