6

What is an efficient way to generate a random contingency table? A contingency table is defined as a rectangular matrix such that the sum of each row is fixed, and the sum of each column is fixed, but the individual elements may be anything as long as the sum of each row and column is correct.

Note that it's very easy to generate random contingency tables, but I'm looking for something more efficient than the naive algorithm.

dsimcha
  • 67,514
  • 53
  • 213
  • 334

4 Answers4

6

Looking at the code of the networksis package for R might be helpful. I believe that efficient computation requires fancy Markov Chain sequential importance resampling techniques, so you might want to avoid reimplementing this if you can avoid it.

Edit: The relevant paper is Chen, Diaconis, Holmes, and Liu (2005). In the words of the authors, "[o]ur method compares favorably with other existing Monte Carlo- based algorithms, and sometimes is a few orders of magnitude more efficient."

chl
  • 27,771
  • 5
  • 51
  • 71
othercriteria
  • 431
  • 2
  • 4
  • Thanks. This is exactly the answer I didn't want to hear--something amazingly complicated. I needed to know this because I'm working on a small open-source statistics lib for the D programming language. Knowing this, I guess I just won't include this feature. – dsimcha Jun 05 '09 at 00:40
  • Would you be able summarise the paper's algorithm(s)? Or include an example of this solution? – Jon Aug 02 '17 at 20:20
1

A solution to this problem, and implemented in R, is Algorithm AS159. Here is the paper

Patefield, W. M. (1981) Algorithm AS159. An efficient method of generating r x c tables with given row and column totals. Applied Statistics 30, 91–97.

You can follow this link for implementations of the algorithm. If you're used to working with R, you can simply use its r2dtable function for this.

This algorithm can be used to generate Monte Carlo p-values for Chi-squared tests as produced in R's chisq.test function. Note that chisq.test does not call r2dtable, but a direct C version of the AS159 algorithm.

Jon
  • 2,373
  • 1
  • 26
  • 34
1

This sounds like a constraint satisfaction problem (CSP) to me.

You would basically start at some point and choose a cell's value randomly from the set of allowed values. Then you update the sets of eligible values for all cells in the same row/column and choose the next cell (according to the CSP heuristic you are using) to (randomly) assign a value to, again from its set of eligible values. Again, you also have to update the sets of eligible values for all cells in the same row/column. In case you encounter a cell that has an empty set of eligible values, you have to do backtracking.

However, the notion of 'set of eligible values' might be hard to represent in a data structure, depending on the range of values you are allowing.

Roland Ewald
  • 4,630
  • 3
  • 35
  • 49
0

I'm not sure what your naïve algorithm is. Here's the first thing I think of:

m\cdot n variables with m+n linear constraints leads to the expectation that the solution space has (m-1)\cdot(n-1)-1 degrees of freedom.

Suppose we just pick that many random numbers to start with.

  a11     a12   ... a1[n-1] b
  a21     a22   ... a2[n-1] x2-x1+b
  ...     ...   ...   ...   ...
a[m-1]1 a[m-1]2 ...    d    f
   c    y2-y1+c ...    g    e

Define the constants x_i=\sum_{j=1}^{n-1}a_{i,j} and y_j=\sum_{i=1}^{m-1}a_{i,j}.

This leads to the following constraints on the variables b, c, d, e, f, g:

x_1+b=\sum_{j=1}^{n-2}a_{m-1,j}+d+f=(m-2)\cdot(c-y_1)+\sum_{\i=1}^{m-2}y_i+g+e
y_1+c=\sum_{i=1}^{m-2}a_{j,n-1}+d+g=(n-2)\cdot(c-x_1)+\sum_{j=1}^{n-2}x_j+f+e

This is a linear system of 6 variables. It probably has a unique solution… I'll work on that tomorrow. (Lack of Maple or other computer algebra systems at the moment.)

Community
  • 1
  • 1
ephemient
  • 198,619
  • 38
  • 280
  • 391