6

Imagine 10 cars randomly, uniformly distributed on a round track of length 1. If the positions are represented by a C double in the range [0,1> then they can be sorted and the gaps between the cars should be the position of the car in front minus the position of the car behind. The last gap needs 1 added to account for the discontinuity.

In the program output, the last column has very different statistics and distribution from the others. The rows correctly add to 1. What's going on?

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int compare (const void * a, const void * b)
{
  if (*(double*)a > *(double*)b) return 1;
  else if (*(double*)a < *(double*)b) return -1;
  else return 0;
}

double grand_f_0_1(){
  static FILE * fp = NULL;
  uint64_t bits;

  if(fp == NULL) fp = fopen("/dev/urandom", "r");
  fread(&bits, sizeof(bits), 1, fp);
  return (double)bits * 5.421010862427522170037264004349e-020; // https://stackoverflow.com/a/26867455
}

int main()
{
  const int n = 10;
  double values[n];
  double diffs[n];
  int i, j;
  
  for(j=0; j<10000; j++) {
  
    for(i=0; i<n; i++) values[i] = grand_f_0_1();

    qsort(values, n, sizeof(double), compare);
    
    for(i=0; i<(n-1); i++) diffs[i] = values[i+1] - values[i];
    diffs[n-1] = 1. + values[0] - values[n-1];

    for(i=0; i<n; i++) printf("%.5f%s", diffs[i], i<(n-1)?"\t":"\n");

  }
  
  return(0);
}

Here is a sample of the output. The first column represents the gap between the first and second car. The last column represents the gap between 10th car and the first car, across the start/finish line. Large numbers like .33 and .51 are much more common in the last column and very small numbers are relatively rare.

0.13906 0.14241 0.24139 0.29450 0.01387 0.07906 0.02905 0.03160 0.00945 0.01962
0.01826 0.36875 0.04377 0.05016 0.05939 0.02388 0.10363 0.04640 0.03538 0.25037
0.04496 0.05036 0.00536 0.03645 0.13741 0.00538 0.24632 0.04452 0.07750 0.35176
0.00271 0.15540 0.03399 0.05654 0.00815 0.01700 0.24275 0.25494 0.00206 0.22647
0.34420 0.03226 0.01573 0.08597 0.05616 0.00450 0.05940 0.09492 0.05545 0.25141
0.18968 0.34749 0.07375 0.01481 0.01027 0.00669 0.04306 0.00279 0.08349 0.22796
0.16135 0.02824 0.07965 0.11255 0.05570 0.05550 0.05575 0.05586 0.07156 0.32385
0.12799 0.18870 0.04153 0.16590 0.02079 0.06612 0.08455 0.14696 0.13088 0.02659
0.00810 0.06335 0.13014 0.06803 0.01878 0.10119 0.00199 0.06656 0.20922 0.33263
0.00715 0.03261 0.05779 0.47221 0.13998 0.11044 0.06397 0.00238 0.04157 0.07190
0.33703 0.02945 0.06164 0.01555 0.03444 0.14547 0.02342 0.03804 0.16088 0.15407
0.10912 0.14419 0.04340 0.09204 0.23033 0.09240 0.14530 0.00960 0.03412 0.09950
0.20165 0.09222 0.04268 0.17820 0.19159 0.02074 0.05634 0.00237 0.09559 0.11863
0.09296 0.01148 0.20442 0.07070 0.05221 0.04591 0.08455 0.25799 0.01417 0.16561
0.08846 0.07075 0.03732 0.11721 0.03095 0.24329 0.06630 0.06655 0.08060 0.19857
0.06225 0.10971 0.10978 0.01369 0.13479 0.17539 0.17540 0.02690 0.00464 0.18744
0.09431 0.10851 0.05079 0.07846 0.00162 0.00463 0.06533 0.18752 0.30896 0.09986
0.23214 0.11937 0.10215 0.04040 0.02876 0.00979 0.02443 0.21859 0.15627 0.06811
0.04522 0.07920 0.02432 0.01949 0.03837 0.10967 0.11123 0.01490 0.03846 0.51915
0.13486 0.02961 0.00818 0.11947 0.17204 0.08967 0.09767 0.03349 0.08077 0.23426
APA
  • 61
  • 4
  • 2
    You cannot "uniformly distribute" 10 items in the range 0 to 1 using `double`. Please see [Is floating point math broken?](http://stackoverflow.com/questions/588004/is-floating-point-math-broken) and [Why Are Floating Point Numbers Inaccurate?](https://stackoverflow.com/questions/21895756/why-are-floating-point-numbers-inaccurate) – Weather Vane Aug 27 '21 at 17:40
  • 1
    The last column is the result of adding two random numbers, so it *will* have a different distribution. Specifically, it will have the same triangular shaped distribution as rolling two dice. – user3386109 Aug 27 '21 at 17:45
  • 1
    Instead of 10 cars, try 2. Average position of car 1 will be `0.33`; of car 2 `0.66`. One difference is double the other. – pmg Aug 27 '21 at 19:18
  • Please print estimated mean value and variance for each column. – tstanisl Aug 27 '21 at 20:30
  • @pmg, indeed. It would mean that the mean difference between 1st and 2nd is 0.3333 = 0.6666 - 0.3333, but between the 2nd and 1st is 0.6666 = 1 + 0.3333 - 0.6666. Interesting... – tstanisl Aug 27 '21 at 20:53
  • 7
    @WeatherVane "*You cannot "uniformly distribute" 10 items in the range 0 to 1 using `double`*" -- Sure you can, to more precision than you're likely to need. Of course floating-point can't exactly represent all real values, and the precision of `double` varies widely over the range 0.0 to 1.0, but at worst it's typically going to be accurate to about 15 decimal digits. The conversion from `uint64_t` to `double` is going to lose some digits, but not enough to matter for this program. – Keith Thompson Aug 27 '21 at 21:15

3 Answers3

4

Your code is ok. The mean value of the last difference it two times larger than the others.

The paradox comes from the fact is that rather than selecting 10 points on a unit interval one actually tries to divide it into 11 sub-intervals with 10 cuts.Therefore the expected length of each sub-interval is 1/11.

The difference between consecutive points is approaching 1/11 except the last pair because it contains the last sub-interval (between the last point and point 1) and the first one (between point 0 and the first point).

Thus the mean of the last difference is 2/11.

tstanisl
  • 13,520
  • 2
  • 25
  • 40
  • I think this is close. How could I generate the [0,1> positions of 10 cars so that all locations are equally likely to be occupied, and all gaps are statistically indistinct? – APA Aug 27 '21 at 23:40
  • 2
    I guess fixing first point at 0, selecting remainig 9 randomly, sorting and finally shifting **all** points by a random number will get the job done. – tstanisl Aug 28 '21 at 07:54
  • I tried that (plus an extra sort of positions after the shift) and the last gap still has twice the average of the others. See https://gist.github.com/alwynallan/2f6d9fd125a7653dbd8d7217cafe275d – APA Aug 28 '21 at 09:03
  • @APA: shift only after sort, sort only once. When calculating `diffs` (from `0` to `n`) add `1.0` if it's negative. – pmg Aug 28 '21 at 09:41
  • Shift by no more than `1 - values[9]` (after sort, of course) if you want to keep the values in the same order and between `0` and `1` :-) – pmg Aug 28 '21 at 09:49
  • OK, I corrected the code per pmg's first comment and verified that the gaps appear to come from the same distribution. https://gist.github.com/alwynallan/bc1620c9bd87b2a186d7e6dd7e05965f I find this solution very unsatisfying, because the practical effect is just to randomly smear the problematic last column over all the columns. I didn't get a lightbulb moment when I realized my original mistake, and feel confident I won't make a similar mistake in the future. – APA Aug 29 '21 at 12:45
  • I implemented the suggestion in pmg's second comment, which keeps the car positions in sorted order after the shift. https://gist.github.com/alwynallan/7543d86d111c5fa885020022ab842fe2 This also appears to produce gaps that come from the same distribution and have the same statistics. – APA Aug 31 '21 at 08:04
2

"There is no special points on the circle"

The thing is that, on a circle, one car looks always the same, and so there is no need to relate to zero: you can just relate to the first car. This means that you can fix the first car at zero, and treat the random positions of the other cars as related to it (measured from it).

And so, the convenient solution is to fix the first car at zero and think of the 9 numbers you still generate as positions related to the first one.

Hope it's a satisfying answer :-)

IDENTITY (or Which diff is first?)

If 10 cars with labels ("1","2" and so on) are placed randomly on a circle, the difference from the "1" to the next will average 1/10. While sorting, the first diff "loses it's identity", what it is changes: it's similar to that if you chose the 1st diff to be the longest one, it would average more. Choosing it based on relation of cars to zero skews (or, in nicer terms: changes) things in a similar manner.

The first difference (2nd, 3rd etc.) just becomes something different – defining it as a difference from a given car is more intuitive, and gives an option to use it as a reference (playing nicely with circle symmetry); the distribution of the rest of cars with respect to it is a uniform one. Dealing with the smallest of random points is not that simple.

Summary: define what you're calculating, know your definitions and probability is non-intuitive

wojand
  • 158
  • 9
  • I've given this an upvote as well as the other answer. It comes closer to an intuitive reason why the original code didn't model the gaps correctly. – APA Aug 31 '21 at 08:16
  • Thanks! Probability is not intuitive in a truly unique way, more often than not. – wojand Aug 31 '21 at 08:24
0

After 3 months of puzzling over this, I have an explanation that is intuitive, at least to me. This is cumulative to the answers provided by @wojand and @tstanisl.

My original code is correct: it uniformly distributes points on the interval, and the forward differences of all points have the same statistical distribution. The paradox is that the forward difference of the highest-value point, the one that crosses the 0-1 discontinuity, is on average twice the others and it's distribution has a different shape.

The reason this forward difference has a different distribution is that it contains the value 0. Larger forward differences (gaps) are more likely to contain any fixed value, simply because they are larger.

We could search for the gap that contains 1/pi, for example, and it too would have the same atypical distribution.

APA
  • 61
  • 4