1

I'm running n split tests on a website. I want to assign an evenly distributed integer user id to one of the n buckets, and deterministically so the same user always gets the same test.

At this point, I can just pick an index in the list of split tests by modding the user id by n. What if I want to weight certain tests?

For example, bucket #1/21 is assigned 90% of the time and the remaining 20 tests are assigned 0.5% of the time.

I feel like I can somehow scale up the size of my list and still use the mod technique to accomplish this, but having potentially huge, temporary lists in memory seems inelegant.

Bluu
  • 5,226
  • 4
  • 29
  • 34

1 Answers1

4

If most buckets have distinct sizes, where size is defined as percentage of ids, then you'll have to represent this in memory somehow. Otherwise, how else are you going to know these percentages?

One solution to use is to have let's say 100 virtual buckets, each representing 1% of the ids. Then associate 90 of the virtual buckets to bucket #1/21. Then you can perform a mod 100 and if it falls in the fist 90 virtual buckets, assign the id to bucket #1. You can get the optimal number of virtual buckets by dividing each bucket's percentage by the GCD of all percentages, which in your example is 0.5 (GCD(90, 0.5)).

From your example, there is only one distinct bucket size though. The best solution really depends on what types of arrangements you could have.

moinudin
  • 134,091
  • 45
  • 190
  • 216
  • To clarify, the optimal number of virtual buckets is the SUM of each bucket's percentage divided by the GCD. I was able to code this in Python naively with a list that grows to the optimal number. I'm wondering if this can be done with less memory, such as by only recording the number ranges where the buckets fall, which would look like the chosen response here http://stackoverflow.com/questions/2140787/select-random-k-elements-from-a-list-whose-elements-have-weights. If you could post pseudocode, that would be great. Otherwise I will post my solution soon. – Bluu Jan 03 '11 at 02:40
  • Also, could you clarify what you mean by one distinct bucket size? What other arrangements are there? – Bluu Jan 03 '11 at 02:41
  • @Bluu Regarding distinct bucket size, your example was 1x90% + 20x0.5%. The 90% bucket is the only one that differs from the rest. – moinudin Jan 03 '11 at 10:38
  • @Bluu You can record the ranges, but then you'll lose O(1) complexity on all operations. – moinudin Jan 03 '11 at 10:42