5

I have a movie database where I need to populate with data so it becomes easier to test and develop the application. There's tables to hold movie ratings and user accounts, the users rate the movies.

I've started to develop a script to populate the database with fake and generic data but I don't know how to randomize the rating. For each movie I select a random number of users, 100, 500, 1000, whatever. And for each of those users I randomize a rating from 1 through 10. But these ratings are resulting in the same average, around 5. Which means the distribution of ratings (1 through 10) for a specific movie is basically the same. This is not "realistic" at all as all movies with ratings generated like this will have the same average, thus the same ratings from different users and different amount of users, doesn't really matter.

I wanted movie A to have an average of 7, movie B average of 5, movie C average of 8, etc... But I just don't want the average to be different for every movie. I mean, it would be nice to produce ratings like this (for a specific number of users): http://www.imdb.com/title/tt1046173/ratings or this http://www.imdb.com/title/tt0486640/ratings

You know, something random that could produce two different variations like those above. I hit refresh and I get the first graph, I hit refresh and get the second, hit again and get something different or similar, something "random" and "realistic".

I'm also going to display graphs like this on my app so it would look nice to have different distributions. But I have no idea how can I randomly accomplish this with a simple script to generate all that.

How can I solve this? Maybe it's too much work not worth it?

Maybe something simpler, like select a point (between 1 and 10) and then create a normal distribution of ratings where that selected point is the highest one, that would work for me.

rfgamaral
  • 16,546
  • 57
  • 163
  • 275
  • not quite understand your question ... do you want to randomly select rating chart from existing movie list ? – ajreal Dec 30 '10 at 02:10
  • No, I want to randomize ratings that look similar to the charts above so I can insert them into a database and have some data to work with. – rfgamaral Dec 30 '10 at 02:12
  • 1
    echo '9'; // You can't prove its not random –  Dec 30 '10 at 02:26
  • I'm just curious: is there a particular reason why you care what the numbers are? I guess I'd just generate test data, verify that my algorithms are correct, and not really sweat what the numbers "look" like. – arcain Dec 30 '10 at 03:45
  • That's what I'm doing "generating data". And I'm not just talking about algorithms, I'll have to design the website too and I want to have different references so I can properly design the charts. It's not about "what the numbers look like", is about having some sort of "realistic" data other than uniform distribution which is not of much help. – rfgamaral Dec 30 '10 at 05:31

6 Answers6

4

You want to fix the mean, and probably the variance, and generate random numbers around those.

This should help you get started: Generating random numbers with known mean and variance

Edit: Actually, if you think about it this can be solved easily: the reason your numbers are tending towards 5, is because your scale is between 1 and 10 (so the mean is 5).

Just take your random numbers, add 8 to all of them, and round any number greater than 10 down to 10, and you'll get something centered around 8-ish (but skewed above). Probably good enough for your purposes?

Community
  • 1
  • 1
Kenny Winker
  • 11,919
  • 7
  • 56
  • 78
  • I don't think that's it. The numbers are tending towards 5 cause the random generated numbers are uniform, the probability of generating one number is exactly the same for every other number. Adding 8 to all of them and rounding numbers larger than 10 down to 10 will give me something slightly different but every rating will have similar amount of votes. – rfgamaral Dec 30 '10 at 05:25
3

Keep in mind that with standard RNGs (random number generators) your will get very even distribution of values. Given enough 'random' values you will get average results, as you have discovered. For the population of your database, I would consider this approach:

Select a random number that will act as the average score for the movie. Then, generate a set of random numbers in the upper bound of that average. For example, if you randomly generate a 7, generate random numbers between 5 and 9. Then throw in a couple of values from 1 through 6 and 8 through 10 to give the appearance of outliers.

EDIT:

This answer might be what you're looking for, complete with code in Java.

Even Distribution Example:

Your code is likely similar to the following:

public class EvenDistribution
{
    private static Random random = new Random();

    public static void main(String[] args)
    {
        int maxValue = 20;

        int[] distribution = new int[maxValue];

        int iterations = 1000;

        for (int i = 0; i < iterations; i++)
        {
            int rand = random.nextInt(maxValue);
            distribution[rand]++;
        }

        for (int i = 0; i < distribution.length; i++)
        {
            System.out.println(i+1+": "+distribution[i]);
        }
    }
}

This class had the following output:

1: 47
2: 45
3: 59
4: 52
5: 54
6: 52
7: 49
8: 49
9: 49
10: 48
11: 40
12: 43
13: 42
14: 61
15: 43
16: 55
17: 47
18: 55
19: 64
20: 46

The distribution is very even. 19 looks a little abnormal, but overall we can say that this method of RNG produces predictable results.

Using the Math Uncommons library mentioned above, I used the similar code, using the GaussianGenerator.

public class RandomDistribution {
    private static MersenneTwisterRNG random = new MersenneTwisterRNG();
    private static GaussianGenerator gen = new GaussianGenerator(7, 3, random);

    public static void main(String[] args)
    {
        int maxValue = 20;

        int[] distribution = new int[maxValue];

        int iterations = 1000;

        for (int i = 0; i < iterations; i++)
        {
            int rand = Math.abs(gen.nextValue().intValue());
            distribution[rand]++;
        }

        for (int i = 0; i < distribution.length; i++)
        {
            System.out.println(i+1+": "+distribution[i]);
        }
    }
}

It produced the following output:

1: 19
2: 27
3: 41
4: 68
5: 110
6: 111
7: 125
8: 138
9: 125
10: 85
11: 64
12: 32
13: 32
14: 14
15: 5
16: 2
17: 1
18: 0
19: 1
20: 0

Seems like this library would be very good for what you are trying to accomplish.

Community
  • 1
  • 1
Wayne Hartman
  • 18,369
  • 7
  • 84
  • 116
  • So I just need something like that but for PHP. The GaussionGenerator is probably what I'm looking for. – rfgamaral Dec 30 '10 at 05:20
  • @Nazgulled: The Math Uncommons is Open Source Software, so you could adapt the code they use for the GausianGenerator to PHP. – Wayne Hartman Dec 30 '10 at 05:33
1

Try the Mersenne Twister Algorithm for good quality random numbers.

http://en.wikipedia.org/wiki/Mersenne_twister

I think there are some php implementations of these bad guy:

http://www.phpdig.net/ref/rn35re672.html

Nice php implementation :D

MRFerocius
  • 5,509
  • 7
  • 39
  • 47
0

My adive is yo involve time in randon number generation, also use functions like mt_rand to improve the random generation. Try doing some complex float op and the casting to int and finally applying a % max_value so that result fits your limit.

Example:

function x()
{
 return (time() * 7.3333333333 * mt_rand(0.1 , 10.1));
}

$rank = (x() + 3.99999) % 10);

I'm not saying this works but ilustrates the idea. Hope it helps!

guiman
  • 1,334
  • 8
  • 13
0

As implied by Kenny, you want to look at a Normal Distribution. If you look at the ratings on IMDB, you will that most films follow a normal distribution. The exceptions are the very top and bottom rankings. A lot of people will say they hate or love a film - they exaggerate their true feeling, hence these spikes. So for an accurate set of data, you will need to add these in. Perhaps let the lowest ranking = (sum of the next two lowest) * a constant?

winwaed
  • 7,645
  • 6
  • 36
  • 81
  • I don't need really accurate data, I just don't want all movies to have a similar distribution (for testing purposes only), centered in the same mean. I'll look into normal distribution with PHP then. – rfgamaral Dec 30 '10 at 05:15
0

I too support Kenny's advice but would like to add a note on implementation. Although this isn't the best approach I've seen it implemented a few times due to it's ease.

Imagine an array ten elements long each element containing a value of 10. If you were to generate a random number between 1 to 100 you could count into the array summing each element advancing to the next index in the array if the value is greater than the sum of the values of the array up to this point. In this way you are able to map 1-100 to 1-10.

Although the above would be a horrible use of this technique you can readily see how you can with a little creativity create your own non uniform distributions. For instance consider:

1,2,4,8,16,16,8,4,2,1

The above 10 elements sum to 64 and so would be well suited to mapping 64 to 10 (This is just an illustration). The implementations that I've seen like to have the distribution always sum to a particular number, but if you encapsulate getting a random number from 1-10 then you can have distributions that sum differently.

By only creating a few such distributions you can potentially create many sensible distributions by summing the probability vectors (consider a distribution highly localized around 3 and a distribution highly localized around 8, perhaps it's the latest zombie slasher and the zombie lovers all voted 8 because as zombie movies go it was pretty good and the rest of the movie going public voted 3 because... in general it more less sucked ).

Quaternion
  • 10,380
  • 6
  • 51
  • 102