Speeding up a die roll simulation

Question

I am simulating a infinite sequence of die rolls to calculate the average "hitting time" for a sequence. In this particular case I am looking for the first occurrence of a "11" or a "12". For example in "34241213113..." the first occurrence of "12 is at time 6 and that of "11" is at time 10. Here is my python code.

import numpy as np

NN=1000000
t11=np.zeros(NN)
t12=np.zeros(NN)

for i in range(NN):
    prev=np.random.randint(1,7)
    flag11=True
    flag12=True
    ctr=2
    while flag11 or flag12:
        curr=np.random.randint(1,7)
        if flag11 and prev==1 and curr==1:
            t11[i]=ctr
            flag11=False
        if flag12 and prev==1 and curr==2:
            t12[i]=ctr
            flag12=False
        ctr=ctr+1;
        prev=curr
print('Mean t11:  %f' %(np.mean(t11)))
print('\nMean t12: %f' %(np.mean(t12)))

As soon as both the sequences have been observed, we start a new sample. It takes about a million sample paths before the expected values converge to the theoretical ones (42 for "11" and 36 for "12"). And the code takes about a minute to run. I am new to python, and have been using it for just about a month.

I was wondering if there was a way to speed up the code, maybe a different algorithm, or maybe optimizing the routines? Would it have notably different performance on a compiled language vs. an interpreted language? I am

ShadowRanger · Answer 1 · 2017-02-03T06:36:28.600

You can speed this up a lot by doing more work per call to numpy (blocks of random instead of single values), and simplifying your pattern search using Python built-in bytes scanning:

import numpy as np

NN=1000000
t11=np.zeros(NN)
t12=np.zeros(NN)

for i in range(NN):
    block = b'\xff'  # Prepop w/garbage byte so first byte never part of cnt
    flag11 = flag12 = True
    ctr = 1  # One lower to account for non-generated first byte

    while flag11 or flag12:
        # Generate 100 numbers at once, much faster than one at a time,
        # store as bytes for reduced memory and cheap searches
        # Keep last byte of previous block so a 1 at end matches 1/2 at beginning of next
        block = block[-1:] + bytes(np.random.randint(1, 7, 100, np.uint8))

        # Containment test scans faster in C than Python level one-at-a-time check
        if flag11 and b'\x01\x01' in block:
            t11[i] = ctr + block.index(b'\x01\x01')
            flag11 = False
        if flag12 and b'\x01\x02' in block:
            t12[i] = ctr + block.index(b'\x01\x02')
            flag12 = False
        ctr += 100
print('Mean t11:  %f' %(np.mean(t11)))
print('\nMean t12: %f' %(np.mean(t12)))

On my (admittedly underpowered machine) your original code took ~96 seconds to run; my optimized version took ~6.6 seconds, or about 7% of the original runtime. Even given that (on average) over half the random generated is not needed, it's still faster to do so when it avoids more Python level work to loop and try again.

Rewriting a bit further, you can avoid the double-scan of block by changing:

        if flag11 and b'\x01\x01' in block:
            t11[i] = ctr + block.index(b'\x01\x01')
            flag11 = False

to the somewhat more verbose, but more efficient:

    if flag11:
        try:
            t11[i] = ctr + block.index(b'\x01\x01')
        except ValueError:
            pass
        else:
            flag11 = False

(and make an equivalent change to flag12 testing)

Since the first 100 bytes generated typically have a hit anyway, this means you replace two scans with one, and reduces overall runtime to ~6.0 seconds. There are more extreme microoptimizations available (that are more about knowing the CPython internals than any logical improvements) that can get it down to ~5.4 seconds on my machine, but they're ugly and not worth the bother 99.9% of the time.

Note: Code tested on Python 3; to make it work on Python 2, change `bytes(np.random.randint(1, 7, 100, np.uint8))` to `bytes(np.random.randint(1, 7, 100, np.uint8).data)` so Py2 (where `bytes` is just an alias for `str`) properly converts using the buffer interface, rather than erroneously making a human readable string (which will never match). The code actually runs much faster on Py2 (haven't investigated reasons), taking ~4.7 seconds. — ShadowRanger, Feb 03 '17 at 06:13
Hey thanks for this. I like that you stuck to python and optimized or changed algorithm instead of changing language or using a different tool. This was most along the lines of what I was looking for. — ITA, Feb 03 '17 at 14:39

Gene · Answer 2 · 2017-02-04T20:39:45.510

There is a good tool for this problem: a finite state machine.

But Python isn't a very good the language for a fast implementation.

Here is a non-deterministic state machine that recognizes the two sequences in any stream of inputs. The * denotes throws of 3 through 6:

NFSM

This is hard to implement because it can occupy more than one state at a time, but there's a standard algorithm called the subset construction that converts this to a deterministic state machine, which is very efficient to implement. Applying that here produces this:

DFSM

Here is a C implementation. In Python you would use a map taking a tuple of the current state plus the input number to the next state. Here we use goto to implement the map in terms of position in the executing code:

#include <stdio.h>
#include <stdlib.h>

#define ROLL do { r = 1 + rand() %6; } while (0)

int main(void) {
  int n_trials = 10000000;
  int t_total_11 = 0;
  int t_total_12 = 0;

  for (int n = 0; n < n_trials; n++) {
    int r, t = -1, t_11 = 0, t_12 = 0;
    A:
      ++t;
      ROLL;
      if (r == 1) goto AB;
      goto A;
    AB:
      ++t;
      ROLL;
      if (r == 1) goto ABC;
      if (r == 2) goto AD;
      goto A;
    ABC:
      ++t;
      if (!t_11) {
        t_11 = t;
        t_total_11 += t_11;
        if (t_12) continue;
      }
      ROLL;
      if (r == 1) goto ABC;
      if (r == 2) goto AD;
      goto A;
    AD:
      ++t;
      if (!t_12) {
        t_12 = t;
        t_total_12 += t_12;
        if (t_11) continue;
      }
      ROLL;
      if (r == 1) goto AB;
      goto A;
  }
  printf("Avg for 11: %lf\n", (double) t_total_11 / n_trials);
  printf("Avg for 12: %lf\n", (double) t_total_12 / n_trials);
  return 0;
}

On my old Macbook, this does 10 million (not 1 million) iterations in 5.3 seconds. So it's a cool ~100 times faster. Of course a good bit is dependent on the speed of the PRNG. Gnu's rand is fast but not so random. Apparently it's good enough to illustrate convergence though. The program prints:

Avg for 11: 41.986926
Avg for 12: 35.997196

When I have more time, will try a Python impl.

I am not familiar with with finite state machines, but it seems like a good concept to look up. Even with the diagram, I understand ABC and AB are the "11" and "12" we are looking for. Just to clarify does each edge represent a die roll and the A, B nodes steps on the way to AB and ABC? Do you suggest any reading material to look up finite state machines? — ITA, Feb 03 '17 at 14:37
@IvanAbraham Yes. FSMs recognize formal languages over a given alphabet. Here the "language" is strings of digits in 1-6. Each edge consumes one "character," which is the result of one die roll. The double circles are accepting states. When the machine is in an accepting state, the input string has been recognized "in the language". Otherwise it's not. In this problems, the accepting states connote that resp. a 11 or a 12 has been encountered. The complexity of the machine is due to strings like 112, which contains both a 11 and a 12. It successfully recognizes both. — Gene, Feb 04 '17 at 01:03

score 0 · Answer 3 · answered Feb 03 '17 at 05:27

Performance of programs in compiled languages is significantly better than that of an interpreted language. This is the reason high frequency trading, video game engines and other highly demanding software is programmed in compiled languages such as c++.

In terms of optimizations, you could try using pythons compilation features, or running the program natively instead of inside a IDE.

score 0 · Answer 4 · edited May 23 '17 at 12:24

Here is a Cython implementation of your code snippet which analyzes 10^6 dicerolls in 0.7 seconds on my machine:

from libc.stdlib cimport rand
import numpy as np
cimport numpy as np

DTYPE = np.int64
ctypedef np.int64_t DTYPE_T


cdef int simple_minded_randint(int min_val, int max_val):
    """For demonstration purpose only! Does not generate a uniform distribution."""
    return min_val + rand() % max_val


def diceroll(numrolls):
    cdef long NN = numrolls
    cdef long i
    cdef DTYPE_T ctr, prev, curr
    cdef int flag11, flag12
    cdef np.ndarray[DTYPE_T, ndim=1] t11 = np.zeros(NN, dtype=DTYPE)
    cdef np.ndarray[DTYPE_T, ndim=1] t12 = np.zeros(NN, dtype=DTYPE)

    for i in range(NN):
        prev = simple_minded_randint(1, 6)
        flag11 = 1
        flag12 = 1
        ctr = 2
        while flag11 or flag12:
            curr = simple_minded_randint(1, 6)
            if flag11 and prev == 1 and curr == 1:
                t11[i] = ctr
                flag11 = 0
            if flag12 and prev == 1 and curr == 2:
                t12[i] = ctr
                flag12 = 0
            ctr = ctr + 1
            prev = curr
    print('Mean t11:  %f' %(np.mean(t11)))
    print('Mean t12: %f' %(np.mean(t12)))

I added some static types and use the random generator from the C standard library, because using np.random.randint() in a loop slows things down quite a bit. Note that this random generator is for demonstration purpose only as it does not generate a uniform distribution, see this answer .

Speeding up a die roll simulation

4 Answers4