3

I am trying to create a matrix with random numbers where the rowSums should exactly be 1.

I already have a condition which checks if the rowSums is not 1 and tries to correct it.

When I print out the result it looks correct but if I test if all values are 1 it gives me some FALSE values.

How can I correct that?

library(Rcpp)

cppFunction('
NumericMatrix imembrandc(int n, int k) {
  NumericMatrix u( n , k );
  IntegerVector sequ = seq(1,100);
  NumericVector sampled;
  for (int i=0; i < k; ++i) {
    sampled = sample(sequ, n);
    u(_,i) = sampled / sum(sampled);
  }

  if (is_true(any(rowSums(u) != 1))) {
    u(_,1) = u(_,1) + (1 - rowSums(u));
  }

  return(u);
}')

When I print out the rowSums of the result it looks correct:

res = imembrandc(n = 10, k = 5)
rowSums(res)

[1] 1 1 1 1 1 1 1 1 1 1

But checking it gives some FALSEs:

rowSums(res) == 1

[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE

SeGa
  • 9,454
  • 3
  • 31
  • 70
  • If you use `View(res)` you will se that they are not really 1 – pcampana May 24 '19 at 07:45
  • `res` is a matrix of values and the values shouldnt be 1. But the sum of each row should be. So if you do `View(rowSums(res))` you will also just see 1s, but actually they're not. My question is more, how can I make sure that the `rowSums` is indeed exactly 1. – SeGa May 24 '19 at 07:59

1 Answers1

4

The canonical way to generate n random numbers that sum to 1 is to generate n - 1 values from [0,1), add 0 and 1 to the list and take the difference of the sorted list. Of course, this depends on the distribution you want for the random numbers. This can be expressed in R as

set.seed(42)
v <- diff(sort(c(0, runif(5), 1)))
v
#> [1] 0.28613953 0.35560598 0.18870211 0.08435842 0.02226937 0.06292459
sum(v)
#> [1] 1

Created on 2019-05-24 by the reprex package (v0.2.1)

In your case in C++:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix imembrandc(int n, int k) {
  NumericMatrix u(n, k);
  for (int i = 0; i < n; ++i) {
    NumericVector row = runif(k - 1);
    row.push_back(0.0);
    row.push_back(1.0);
    u(i, _) = diff(row.sort());
  }
  return u;
}

/*** R
set.seed(42)
res = imembrandc(n = 10, k = 5)
rowSums(res)
rowSums(res) == 1
all.equal(rowSums(res),rep(1, nrow(res)))
*/

Note that I am generating rows to begin with, while you were generating columns and then tried to correct the rowSum. Output:

> set.seed(42)

> res = imembrandc(n = 10, k = 5)

> rowSums(res)
 [1] 1 1 1 1 1 1 1 1 1 1

> rowSums(res) == 1
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

> all.equal(rowSums(res),rep(1, nrow(res)))
[1] TRUE

BTW, all.equal gives TRUE also for your matrix, since the difference is really small. But I find it better to avoid the problem from the beginning.

Ralf Stubner
  • 26,263
  • 3
  • 40
  • 75
  • Where is there a proof that the resulting numbers sum exactly to one in floating-point arithmetic? – Eric Postpischil May 24 '19 at 12:02
  • At least for me `all(rowSums(res) == 1)` giving `TRUE` is proof enough :) – SeGa May 24 '19 at 12:45
  • @EricPostpischil I don't have a proof, but in over a million tries I have not found a single case where it did not work as expected. – Ralf Stubner May 24 '19 at 17:27
  • 1
    The distribution resulting from this method will skew toward low numbers, resulting in few samples where there are additions of disparate magnitudes that would be fertile ground for rounding errors. Therefore, experience may give a misleading impression. The question requests a method where the same is exactly 1. – Eric Postpischil May 24 '19 at 17:31
  • @EricPostpischil Agreed, empirical methods can be misleading. However, I also failed to create a counter example, i.e an increasing list of floating point numbers from 0.0 to 1.0 where the differences do not sum to 1.0. Do you have such an example? – Ralf Stubner May 24 '19 at 19:20
  • @RalfStubner: Why would it be increasing? I see the code creates some random numbers and sorts them. Then it finds the differences between adjacent numbers. The initial random numbers are increasing due to the sort, but the differences are not sorted. – Eric Postpischil May 24 '19 at 19:32
  • @EricPostpischil The increasing list corresponds to the sorted list of random numbers. The differences of these numbers can be in any order. – Ralf Stubner May 24 '19 at 19:39
  • 3
    [This question](https://stackoverflow.com/questions/55412915/does-double-z-x-y-guarantee-that-zy-x-for-ieee-754-floating-point) provides a disproof. For illustration, consider decimal floating-point with two significant digits. The sequence of increasing numbers might be 0, .005, .11, 1. The real number differences are then .005, .105, and.89. The three-digit .105 must be rounded to two digits, and the round-to-nearest-ties-to-even rule gives .10. So the generated numbers are .005, .10, and .89. Adding the first two produces .10 (real number sum .105, rounds to .10), and the third makes .99. – Eric Postpischil May 24 '19 at 20:19
  • 2
    As noted in my comment on my answer there, the circumstances would be quite rare. – Eric Postpischil May 24 '19 at 20:20
  • @EricPostpischil Thanks for the reference! I think we are in the lucky situation that R's RNG produces at most 32bits of randomness so that the resulting `double`s are to fare apart to trigger this. – Ralf Stubner May 25 '19 at 08:05