Zipf's Law in Java for text generation - too slow

Question

Hey there I'm working on a textgenerator, that should generate millions of different texts. to make the content of each text realistic I used Zipf's Law It is working well, the word distribution is correct.

But the following next() function is executing very slow and since I want to generate millions of articles it has to be changed. (The while loop is the slow part)

Can someone help me with this?

I implemented it like this:

   public int next() {

    int rank;
    double frequency = 0;
    double dice;

    rank = rnd.nextInt(size);
    frequency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
    dice = rnd.nextDouble();


    while (!(dice < frequency) || (rank == 0)) {
        rank = rnd.nextInt(size);
        frequency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
        dice = rnd.nextDouble();
    }

    return rank;
}

EDIT: I obtainded the code from : http://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/

Guess: you are running a 1.7 or less JVM on Linux, and your rnd is a SecureRandom? — fge, Nov 24 '14 at 13:19
Post the complete code so people can compile it without filling out bits themselves. The while loop has only 3 lines. Check which function takes how long to execute. Unless there's something fishy with java's RNG, it's probably the math operation that takes long. — XapaJIaMnu, Nov 24 '14 at 13:33
For the full code you can take the code from the website I posted above in my EDIT. The bottleneck is (within) the while loop? I tried using fastmath instead of math but the codeexecution got 1/3 slower.. — Any1, Nov 24 '14 at 13:51

score 5 · Accepted Answer · answered Nov 24 '14 at 15:25

The implementation that you copied ... has some issues. One could probably say that it is plainly wrong, because it is using random values, and when in a computation like

rank = rnd.nextInt(size);
friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;

the rank value is 0, then the frequency is Infinity, and messes up some of the statistics.

I tried to correct these erros, but have not analyzed the implementation and have not compared it to the definition of the Zipf distribution function. So if somebody copies my code, he might find out that it still "...has some issues".

The implementation of the next function is, strictly speaking, not "total correct", in the sense that it does not necessarily terminate. There's nothing preventing the loop from running forever. Depending on the parameters, it may be more or less likely that it will take a while until it terminates. And I think that's also one of the main reasons for your "performance" issue: For some values, the condition (dice < frequency) is just very unlikely to happen....

Regardless of that, the goal that you want to achieve can be formulated more generically: You have a certain distribution of probabilities. And you want a "random" function that returns random values based on this distribution.

One simple and generic way to achieve this is to map the (cumulated) probability distribution to the target values with a NavigableMap. This map can then be used to quickly look up the target value, given a random value between 0.0 and 1.0 that is supplied by a java.util.Random instance.

There may be more efficient solutions for particular cases, but again: This is very generic and simple (and still, reasonably efficient).

I implemented this here for the Zipf distribution. Again, I did not verify everything in detail, and there are some +1/-1 oddities (mentioned in the first paragraph), but it should show the idea: The FastZipfGenerator fills the map containing the probability distribution, and in the next() function, just performs a lookup:

import java.util.LinkedHashMap;
import java.util.Map;
import java.util.NavigableMap;
import java.util.Random;
import java.util.TreeMap;

public class ZipfGeneratorTest
{
    public static void main(String[] args) {

        int size = 10;
        double skew = 2.0;

        ZipfGenerator z0 = new ZipfGenerator(size, skew);
        FastZipfGenerator z1 = new FastZipfGenerator(size, skew);

        long before = 0;
        long after = 0;

        int n = 5000000;

        before = System.nanoTime();
        Map<Integer, Integer> counts0 = computeCounts(z0, size, n);
        after = System.nanoTime();
        System.out.println(counts0+", duration "+(after-before)/1e6);

        before = System.nanoTime();
        Map<Integer, Integer> counts1 = computeCounts(z1, size, n);
        after = System.nanoTime();
        System.out.println(counts1+", duration "+(after-before)/1e6);
    }

    private static Map<Integer, Integer> computeCounts(
        ZipfGenerator z, int size, int n)
    {
        Map<Integer, Integer> counts = new LinkedHashMap<Integer, Integer>();
        for (int i=1; i<=size; i++)
        {
            counts.put(i, 0);
        }
        for (int i=1; i<=n; i++)
        {
            int k = z.next();
            counts.put(k, counts.get(k)+1);
        }
        return counts;
    }

    private static Map<Integer, Integer> computeCounts(
        FastZipfGenerator z, int size, int n)
    {
        Map<Integer, Integer> counts = new LinkedHashMap<Integer, Integer>();
        for (int i=1; i<=size; i++)
        {
            counts.put(i, 0);
        }
        for (int i=1; i<=n; i++)
        {
            int k = z.next();
            counts.put(k, counts.get(k)+1);
        }
        return counts;
    }

}

// Based on http://diveintodata.org/tag/zipf/
class ZipfGenerator {
    private Random rnd = new Random(0);
    private int size;
    private double skew;
    private double bottom = 0;

    public ZipfGenerator(int size, double skew) {
        this.size = size;
        this.skew = skew;

        for(int i=1;i <=size; i++) {
            this.bottom += (1/Math.pow(i, this.skew));
        }
    }

    // the next() method returns an random rank id.
    // The frequency of returned rank ids are follows Zipf distribution.
    public int next() {
        int rank;
        double friquency = 0;
        double dice;

        rank = rnd.nextInt(size)+1;
        friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
        dice = rnd.nextDouble();

        while(!(dice < friquency)) {
            rank = rnd.nextInt(size)+1;
            friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
            dice = rnd.nextDouble();
        }

        return rank;
    }


    // This method returns a probability that the given rank occurs.
    public double getProbability(int rank) {
        return (1.0d / Math.pow(rank, this.skew)) / this.bottom;
    }
}



class FastZipfGenerator
{
    private Random random = new Random(0);
    private NavigableMap<Double, Integer> map;

    FastZipfGenerator(int size, double skew)
    {
        map = computeMap(size, skew);
    }

    private static NavigableMap<Double, Integer> computeMap(
        int size, double skew)
    {
        NavigableMap<Double, Integer> map = 
            new TreeMap<Double, Integer>();

        double div = 0;
        for (int i = 1; i <= size; i++)
        {
            div += (1 / Math.pow(i, skew));
        }

        double sum = 0;
        for(int i=1; i<=size; i++)
        {
            double p = (1.0d / Math.pow(i, skew)) / div;
            sum += p;
            map.put(sum,  i-1);
        }
        return map;
    }

    public int next()
    {
        double value = random.nextDouble();
        return map.ceilingEntry(value).getValue()+1;
    }

}

It prints a random sample result (basically, a "histogram"), and some timing results. The timing results are something like

duration 6221.835052
duration 304.761282

showing that it will most likely be faster (even though this should not be considered as a "benchmark"...)

Thanks a lot for your elaborated response and the effort explaining everything! I will try it tomorrow and mark your answer solved then! — Any1, Nov 24 '14 at 17:23
If "the rank value is 0, then the frequency is Infinity", but nobody should care, as the loop terminates before Infinity can get used. — maaartinus, Nov 24 '14 at 18:07
@maaartinus In my particular case, the rank 0 Issue is very Important, because it messes up the word distribution In my text! - I didn't analyze it further, but my most common word appeared way to often in the generated text-- and Marco13 thank you again you are very kind! with your help I get an enormous performance boost and learned something! — Any1, Nov 24 '14 at 21:55
@maaartinus During my tests, the resuting distribution (the histogram) did not match the actual function values of the probability distribution, and preventing the "Infinity" case (just by adding 1) solved this. But I have not verified the methods beyond that. — Marco13, Nov 24 '14 at 23:03
@Marco13 But adding 1 does not only prevent `Inf`, it simply computes something else. It may be the right thing, I do't know. Compare the original version with mine, which also avoids `Inf`; they must work identically. — maaartinus, Nov 25 '14 at 04:26
Would be great, if you commit your code to the apache.commons.math3 project. Their zipf is much slower than yours. Together with the author I worked on the problem of this slow zipf and your solution gave us a boost from 6000ms to 23ms for a zipf distributed text with a length of 3000 words out of a sample of 10000 words. (commons.math3 was 8000ms) — Malte, Dec 01 '14 at 15:09
@Reitffunk I'm not familiar with the process and protocol of contributing to Apache libs. I would consider it, if others found this code useful. But in this case, I think it's not an issue of the Zipf implementation itself, but more of the way how the function is **used**. As mentioned in the answer: It's a more general task that a cumulative value of a prob-function should serve as a basis for a random number generator. So to say, the line `double p = ...` could also be `double p = ApacheZipf.compute(...)`, and this would (should) give the same result with (most likely) similar speed... — Marco13, Dec 01 '14 at 16:19

score 3 · Answer 2 · answered Dec 21 '17 at 02:14

The source you obtained from https://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/ has some bugs.

Here are quick fixes. (1) In the constructor ZipfGeneator(int,double), make sure to compute up to size using equal sign.

public ZipfGenerator(int size, double skew) {
  this.size = size;
  this.skew = skew;

  for(int i=1;i <= size; i++) {
  this.bottom += (1/Math.pow(i, this.skew));
  }
 }

(2) replace

rank = rnd.nextInt(size);

with

rank = rnd.nextInt(size)+1;

Here is the complete source code.

import java.util.Random;

//credit: https://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/ [Online; December 2017]

public class ZipfGenerator {
 private Random rnd = new Random(System.currentTimeMillis());
 private int size;
 private double skew;
 private double bottom = 0;

 public ZipfGenerator(int size, double skew) {
  this.size = size;
  this.skew = skew;

  for(int i=1;i <= size; i++) {
  this.bottom += (1/Math.pow(i, this.skew));
  }
 }

 // the next() method returns an random rank id.
 // The frequency of returned rank ids are follows Zipf distribution.
 public int next() {
   int rank;
   double friquency = 0;
   double dice;

   rank = rnd.nextInt(size)+1;
   friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
   dice = rnd.nextDouble();

   while(!(dice < friquency)) {
     rank = rnd.nextInt(size)+1;
     friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
     dice = rnd.nextDouble();
   }

   return rank;
 }

 // This method returns a probability that the given rank occurs.
 public double getProbability(int rank) {
   return (1.0d / Math.pow(rank, this.skew)) / this.bottom;
 }

 public static void main(String[] args) {
   if(args.length != 2) {
     System.out.println("usage: ./zipf size skew");
     System.exit(-1);
   }

   ZipfGenerator zipf = new ZipfGenerator(Integer.valueOf(args[0]),
   Double.valueOf(args[1]));
   for(int i= 1;i <= 10; i++) {
     System.out.println(i+" "+zipf.getProbability(i));
   }
   //use size = 10 and skew = 2 for testing below
   int hist [] = new int [12];
   for(int i=0;i<12;i++) {
       hist[i] = 0;
   }
   System.out.println("Testing the probability distribution:");
   int sum = 0;
    for(int i= 1;i <= 1000000; i++) {
        hist[zipf.next()]++; 
   }
   for(int i=0;i<12;i++)
     System.out.println(i+" "+hist[i]/1000000.0);
    }

}

Result:

Probability distribution from the formula:
1 0.6452579827864142
2 0.16131449569660355
3 0.07169533142071269
4 0.04032862392415089
5 0.02581031931145657
6 0.017923832855178172
7 0.013168530260947229
8 0.010082155981037722
9 0.007966147935634743
10 0.006452579827864143
Testing the probability distribution from sampling:
0 0.0
1 0.645813
2 0.160766
3 0.071527
4 0.040346
5 0.026039
6 0.01801
7 0.013215
8 0.009953
9 0.007863
10 0.006468
11 0.0

Note, 0 and 11 has 0 probability as expected.

score 1 · Answer 3 · answered Nov 24 '14 at 18:23

You were asking about speed, so I'm presenting a minor optimization. First of all, get rid of the repetitive stuff to see what it's all about:

public int next() {
    while (true) {
        int rank = rnd.nextInt(size);
        if (rank == 0) return return rank;
        double frequency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
        double dice = rnd.nextDouble();
        if (dice < frequency) return rank;
    }
}

So far it should work exactly the same (unless I overlooked something). I moved the test of rank upwards as the following computation is useless in case it's zero. Now there's a single line which we can speed up a bit like

double frequency = Math.pow(rank, -this.skew) * inverseBottom;

Actually, this may slightly change the result due to round-off errors, but I doubt you should care. If rank was constant, you could turn pow into exp to make it faster, but it's not. For a small size, you could precompute a table of ln(rank) and use it like

double frequency = Math.exp(ln[rank] * -this.skew) * inverseBottom;

A better algorithm could surely give you more than this low-level optimizations.

Zipf's Law in Java for text generation - too slow

3 Answers3