4

I want to partition my users into several groups to run an A/B test.

The usual approach is to randomly assign each user to a variant and store the relation until the end of the A/B test. But that will force me to store that association somewhere and I want to avoid it.

Since the user are already registered in my application I would like to have a function that uniformly distributes the users across my tests so I can get non skewed results in my A/B test.

Which kind of hash function should I use?

barracel
  • 1,831
  • 13
  • 24
  • Multiply by a large odd number and take the product modulo 2 ? – wildplasser Nov 22 '12 at 23:32
  • @wildplasser that was one of my first thoughts but I wasn't confident enough to use it. Can you point me to some source that shows me that it will be as good as the traditional pseudo-random choice? – barracel Nov 22 '12 at 23:49
  • Well: test it! It is a Bernoulli process, so the distribution between 0<-->1 (a<-->b) can be a bit different from 50/50. BTW: You don't need to multiply: the odd numbers stay odd and the even ones will stay even... (but you could instead test the Ith-bit of the product) – wildplasser Nov 22 '12 at 23:53
  • @wildplasser even if I test that is evenly distributed (50/50) how can I be sure that there is no hidden correlations (that I'm not aware) between the users in the variations. For example the oldest users (smaller id's) could be grouped in one of the variations and be more sensible to the variation... – barracel Nov 23 '12 at 00:00
  • ... or the people with blue eyes could be overrepresented in one of the groups, or younger people, or women, or asians, or even germans. Run some summary statistics over it, and you'll see. – wildplasser Nov 23 '12 at 00:04
  • I wouldn't know how to do it... That's why I was hoping someone more capable than me have already done it. – barracel Nov 23 '12 at 00:15
  • Well you could always handle them one at a time and flip a coin for each of them. – wildplasser Nov 23 '12 at 00:16
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/19966/discussion-between-barracel-and-wildplasser) – barracel Nov 23 '12 at 00:17
  • If you plan to analyze based on some characteristic (for example, old vs new), I think it's safest to also assign conditions based on that characteristic. But if all you care about is age-of-account and you've been sequentially assigning user ids, you might be able to do it with user ids. I would take new or recent data and run "A/A tests" with no intervention to get a sense of the variance and see if your randomization works. If you get more than the expected level of chance findings, something is wrong. – MattBagg Nov 24 '12 at 01:10

1 Answers1

7

This ACM's paper explains that md5 is a good hashing function to get a both an uniform distribution and no correlations between experiments:

We found that only the cryptographic hash function MD5 generated no correlations between experiments. SHA256 (another cryptographic hash) came close, requiring a five-way interaction to produce a correlation. The .NET string hashing function failed to pass even a two-way interaction test.

barracel
  • 1,831
  • 13
  • 24