3

I have around 10 million + increasing users with Email and Phone numbers. Both pointing to an User ID. I created 2 Hashes. One for Email and other for Phone numbers like

//A single user with Both Email and Phone number pointing to same User ID
$redis->hSet('email-users', 'abc@xyz.com', 1);
$redis->hSet('phone-users', '+192938384849', 1);

Now as there are around millions of users, the Hash is growing to overloaded and I also want to search through these Hashes. Like I want to get the User ID from an Email from email-users hash.

As I found that Hashes should be maintained with ZipList at Redis — best way to store a large map (dictionary) and divided into smaller buckets of a fixed size say max 10000 keys in a single Hash.

So, if I divide my 10 Million Users into buckets of 10000 keys there would be around 1000 Hashes for Emails and 1000 for Phone numbers.

My Questions is, Should I divide my users into these 1000 buckets? and if yes then how can I search through these 1000 buckets? Or is there a better alternative?

P.S. I am using PHP and getting all 1000 Hashes and loop through them can be quite resource intensive and I am afraid that using a wrong approach would also kill the actual performance of Redis Power.

Just for a side note, I think that we can create some algorithm like libketama for consistent hashing to place keys in random servers.

Also if it is hard to work on alphabats, we can convert each email to numbers first like a=1, b=2, c=3 ... z=26 with 0 (Zero) appended for making it unique and +s for @ and . characters. For Example

abcd@gmail.com  ->  10203040+901301090+3015013

So, now we have numbers which make it easier to apply any calculations.

Airy
  • 5,484
  • 7
  • 53
  • 78

2 Answers2

3

what you may do is distribution of letters and numbers according to first or first couple of letters/digits.

you may create your hashes like this; email first letter, phone number first or first two digits

  • email-users-a
  • email-users-b
  • email-users-c
  • phone-users-10
  • phone-users-11

while you do hset/hget, you arrange this on code level.

Edit:

Let's say we will use first two digits for phone numbers and first two letters for email;

then we will have keys like following;

  • email-users-aa
  • email-users-ab
  • phone-users-11
  • phone-users-12

When we have an email like ersoy@gmail.com then we will go to er email hash group which is email-users-er and execute hget email-users-er ersoy@gmail.com.

When we have phone number like 123456789 then we will go to 12 phone hash group which is phone-users-12 and execute hget phone-users-12 123456789.

Ersoy
  • 8,816
  • 6
  • 34
  • 48
  • That seems a good idea. But it will still create to about hundreds of hashes for each alpha orders. And I will still have to loop through all those hundreds of hashes each time. Will that be a good idea. Your technique would definitely decrease overhead but it will still be there – Airy Apr 22 '20 at 17:33
  • actually in your code level you will check the first digit of phone lets say it is 1 then you will go to phone-users-1 - if your email starts with ar then you will go to email-users-ar. so you won't iterate each hash group. – Ersoy Apr 22 '20 at 18:08
  • Erosy even though this will reduce the load into small hashes sliced into 1 or 2 characters, there would still be problems. For example, what if I create a Hash having all **A** emails and then hash cross over 15000 email address in one single A Hash? – Airy Apr 24 '20 at 12:37
  • I didn't get what problems do you refer ? `My Questions is, Should I divide my users into these 1000 buckets? and if yes then how can I search through these 1000 buckets? Or is there a better alternative?` yes you better divide and the solution i proposed divide/search dynamically with small code level changes. – Ersoy Apr 24 '20 at 12:45
  • 1
    @Ersoy, I've answered that very question in my answer. You can distribute the key into buckets based on some hashing algorithm which will always yield the same bucket given the key. This would remove the unnecessary bookeeping on which key was added to which bucket. – kayvis Apr 24 '20 at 14:45
  • @kayvis you just get first n letters or digits and you know what is the bucket. Which part of it is bookkeeping ? – Ersoy Apr 24 '20 at 14:47
  • @Ersoy, you use the full letters/digits. With phone/email and the bucket size as input, using any good hashing algorithm, we would get the bucket number. My answer has the sample code including hashing and retrieving. By bookeeping, what I meant is without using hashing you'd have to store which key is in which bucket somewhere. That can be avoided by hashing to find the bucket number. – kayvis Apr 24 '20 at 14:52
  • why should i hash the all the inputs to get a bucket name when i know the bucket name ? and also no i am not using hashing and know which key is in which bucket. if email is kayvis@gmail.com then i know it is `hget email-users-ka kayvis@gmail.com` - and also it is size friendly. – Ersoy Apr 24 '20 at 14:56
  • 1
    @Ersoy, using your approach, don't you think there may be some buckets which would be 'fuller' than others because of the distribution of letters in the email Ids? I mean some are more probable together than the others: jo(hn) vs xz. IMHO, evenly distributing keys across buckets would be better. – kayvis Apr 24 '20 at 15:00
  • that's a valid case but it doesn't change the fact that both hget and hset works in o(1). – Ersoy Apr 24 '20 at 15:04
  • @kayvis if it was sorted set i would absolutely consider to distribute them as evenly as possible since zadd(`O(log(N)`) is related to the size of set but by using hashes we won't have that problem. – Ersoy Apr 24 '20 at 15:10
  • @Ersoy, while it is true that the time complexity of both hget and hset is O(1), as per the approach mentioned here: https://instagram-engineering.com/storing-hundreds-of-millions-of-simple-key-value-pairs-in-redis-1091ae80f74c, Redis Hash is encoded in memory very efficiently such that max. entries per hash is excellent at 1000. Given that we're already encoding 10,000 entries per hash for the given use case, it's better to optimise at every level possible so that the distribution of keys is as uniform as possible. – kayvis Apr 24 '20 at 15:49
  • yes i know that blogpost, posted in the answer of this question's linked list. is there any recent source which is posted recently (none of the links in the post is working since it's written 9 years ago) to support 1000 is the excellent ? – Ersoy Apr 24 '20 at 16:03
  • It would be great if you provide some benchmark results for both hget/hset in 10M size. @kayvis – Ersoy Apr 24 '20 at 16:04
  • @kayvis https://redis.io/topics/memory-optimization Every time a hash exceeds the number of elements or element size specified it will be converted into a real hash table, and the memory saving will be lost. You may ask, why don't you do this implicitly in the normal key space so that I don't have to care? There are two reasons: one is that we tend to make tradeoffs explicit, and this is a `clear tradeoff between many things: CPU, memory, max element size`. it is not a silver bullet, one part of tradeoff. – Ersoy Apr 24 '20 at 16:21
  • @Ersoy, yes, all the more reasons to ensure the division of input into buckets, if possible, is 'uniform' to whatever degree achievable (to not exceed the specified element size, thereby forcing "the conversion to a real hash table"). I'm certainly not against your approach completely, the only demerit I see is that the distribution could be skewed with a carefully selected input, for instance, if most users are from a particular country say USA, all of whose phone numbers would start with +1[1-9]* – kayvis Apr 24 '20 at 17:00
  • @kayvis yes, there we go back to `that's a valid case but it doesn't change the fact that both hget and hset works in o(1).` :) - thanks for the comments. – Ersoy Apr 24 '20 at 17:02
  • 1
    @Ersoy, Haha, sure. By that argument, why not just hash all the email Ids to a single hash instead of splitting into buckets as it's still O(1)? ;) The whole point of this exercise is to see if we can optimise in any way. You're welcome, thanks for your comments as well :) Have a nice day! – kayvis Apr 24 '20 at 17:20
  • @kayvis because op said i already have two hashes and asked the question in that way; then i proposed a high cardinality solution. if i said yes you may do it in two hashes than it wouldn't be an answer. if you are asking for why high cardinality, it gives you a chance to get keys in chunks - starting with some letters or digits - you can't achieve it in a single one. deleting by pattern is also more simple. you can migrate them more easily etc etc. – Ersoy Apr 24 '20 at 17:31
1

My Questions is, Should I divide my users into these 1000 buckets? and if yes then how can I search through these 1000 buckets? Or is there a better alternative?

Yes. The approach could work in the following way.

For this example, let's treat both the phone numbers and email Ids as strings.

Let's say you have the following buckets(Redis Hash):

For Email Ids: email_0001, email_0002, ..., email_1000
For Phone Numbers: phone_0001, phone_0002, ..., phone_1000
  1. Given an email Id, determine the bucket (max being 1000) by hashing the email Id. You can use consistent hashing for this purpose. Now add the key and value to the appropriate 'bucket'.

    $ HSET "email_0032" "abc@xyz.com" "UID_987"
    
  2. Repeat step 1 for phone numbers. This prevents you from the need of bookeeping which key goes into which bucket. Given the same key, the hash will always give the same value thereby returning the same bucket number.

    $ HSET "phone_0091" "+192938384849" "UID_987"
    
  3. To retrieve a value, first find the bucket by hashing the email/phone and then looking up the value in appropriate bucket.

    $ HGET "phone_0091" "+192938384849"
      UID_987
    
import java.nio.charset.Charset;
import com.google.common.hash.HashFunction;
import com.google.common.hash.Hashing;

public class Sample {

    private static final int BUCKET_SIZE = 1000;
    private static final HashFunction hashFunction = Hashing.murmur3_128();
    private static final Charset UTF8 = Charset.forName("UTF-8");

    private Sample() {
    }

    public static int pickBucket(String key, int buckets) {
        int bucket = com.google.common.hash.Hashing.consistentHash(hashFunction.hashString(key, UTF8).asLong(), buckets);
        return bucket;
    }

    private static void getFromRedisHash(String key) {

        int bucket = pickBucket(key, BUCKET_SIZE);
        // Get From Redis based on the bucket number
    }

    public static void main(String[] args) {

        System.out.println(pickBucket("abc@xyz.com", BUCKET_SIZE));
        System.out.println(pickBucket("+192938384849", BUCKET_SIZE));
    }
}

The above example is in Java, I'm assuming PHP would have similar libraries for hashing.

kayvis
  • 563
  • 2
  • 9
  • Sorry for being late to reply. I just went through your answer and all the discussion you had with Ersoy in his answer. And you pretty much well understood my problem specially about limit of 1000 keys. Please let me convert it into PHP and see how well it works. – Airy Apr 26 '20 at 10:56
  • Kavvis! As I don't have enough knowledge of Java, can you please elaborate what's happening at line `com.google.common.hash.Hashing.consistentHash` by giving the bucket size of 1000? I mean is it limiting the hash to create maximum of 1000 buckets or is it telling that each Hash will have maximum of 1000 entries in each bucket? – Airy Apr 26 '20 at 11:53
  • Please note that there is a problem as mentioned in question that we can have N numbers of buckets but each bucket can have max of 10000 entries – Airy Apr 26 '20 at 11:54
  • @Airy, What we're trying to achieve in the line you have mentioned is to find the bucket number(in the range 1 to 1000) using the given input(email/phone). Each bucket is a Redis Hash named email_0001, email_0002, and so on. Now that you know which bucket the key would have been added to (using pickBucket()), you can lookup the value from that bucket directly using "hget ". Does that answer your question? – kayvis Apr 26 '20 at 18:44
  • @Airy, reg. other question, yes, the assumption is that each bucket can have a maximum of 10k entries. You have to set that configuration. Details can be found here: https://redis.io/topics/memory-optimization. This is to ensure that the hash performance is at it's best. Once we cross that limit, the 'special' encoding will be removed and the hash will be converted to normal encoding by Redis. That's stated in the 3rd paragraph in that link. Courtesy of that link: Ersoy. As always, it's recommended you test with and without the configuration to ensure you're getting optimal performance. – kayvis Apr 26 '20 at 19:00
  • Kayvis! I got that line but Sorry if I missed anything, where are we mentioning that the bucket cap size is 10000 keys in your code? Or how does it know that I have to create a bucket with max entries of 10000 keys? – Airy Apr 26 '20 at 19:04
  • @Airy, you don't mention that in the code. You have to set that as configuration. Reference: https://www.peterbe.com/plog/understanding-redis-hash-max-ziplist-entries and https://redis.io/commands/config-get. That means that, all hashes in your Redis will use 'special' encoding until 10k keys. Once it overflows, Redis will convert the 'specially encoded' hash to a normal hash. – kayvis Apr 26 '20 at 19:14
  • Ok. I think that I will have to manually Cap the limit to 10000 keys on each entry. Or you have any other better idea? – Airy Apr 26 '20 at 19:20
  • @Airy, No. You should not manually cap the limit. It will slow your system down if you have to check how many entries you already have before adding to the hash every time. It's a configuration that you can set. Please refer to my previous comment. Note: Setting the configuration is not a hard limit, it's a limit until which Redis will encode the hash optimally in memory. Post the limit, Redis will treat the hash like a normal hash with no special encoding. – kayvis Apr 26 '20 at 19:25
  • Nice article link you provided. So, you suggest that I should not cap the limit and let the Hash increase as much as it has to after the Ziplist config? Please note that not capping the buckets will consume a lot of memory if any a particular bucket gets over flooded. Would you still suggest to not to cap any bucket? – Airy Apr 26 '20 at 21:10
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/212577/discussion-between-kayvis-and-airy). – kayvis Apr 26 '20 at 21:13