1

I have a numpy array of shape (1e6, 1) that I would like to take weighted samples from based on the values that are largest. However, it is difficult to scale the list to sum to one b/c of the accuracy required for floating point numbers.

Here is an example I can create by using random numbers(in my case the numbers are not random)

import numpy as np
A = np.random.rand(1000000)
probs = A / np.sum(A)
sample = np.random.choice(A, p=probs)
# fails b/c probs don't sum to one
BadProgrammer
  • 371
  • 3
  • 17
  • 1
    This is a problem related to floating point precision. If you do the same with lower sample size (e.g.: 1000 elements) you will get 1.0, but for such large sample sizes, the sum can yield values slightly above or below (e.g.: 1.0000000000000002). Hence if you check for B == 1.0, it will return False. – N. P. Oct 11 '17 at 13:16
  • 1
    Possible duplicate of [Python rounding error with float numbers](https://stackoverflow.com/questions/5997027/python-rounding-error-with-float-numbers) – Reti43 Oct 11 '17 at 13:18
  • yeah, I see it is a floating point problem. But what is the best way to go about solving this so they sum to 1? I'm okay introducing a bit of error, perhaps there is a good way to round? – BadProgrammer Oct 11 '17 at 13:20
  • 1
    that does not address rounding a list of numbers so that their sum is a specific vlaue – BadProgrammer Oct 11 '17 at 13:24
  • Do it in two stages.first divide by original sum then ditribute error over whatever number gives acceptable as 'bit of' – paddyg Oct 11 '17 at 13:38
  • 2
    Definitely not a duplicate of any of the linked questions above, in my opinion. This is a legitimate question about how to deal with a particular issue with floating point accuracy; the linked questions are general 'questions' of the form 'I dont understand floating point numbers'. – Eelco Hoogendoorn Oct 11 '17 at 13:48
  • 1
    @DavidG. No, this is definitely not a duplicate. I don't think you fully understood the problem. – Mad Physicist Oct 11 '17 at 13:49
  • How about set the last element of `probs` *manually* : `probs[-1] = 1- probs[:-1].sum()`? – Divakar Oct 11 '17 at 14:07
  • it's possible probs[:-1] would sum larger than 1.0 but I'm currently kicking this idea around – BadProgrammer Oct 11 '17 at 14:23
  • Any number of elements at the end could be off; using np.clip would be more robust – Eelco Hoogendoorn Oct 11 '17 at 14:42
  • @EelcoHoogendoorn could you post how you would do this? – BadProgrammer Oct 11 '17 at 14:52

1 Answers1

2

Ive written functions to do the exact same thing in the past; sadly I cant find them right now.

The crux was to first argsort the array; after sorting from smallest to largest, summation is stable; you can then create a cumulative distribution from the sorted array, uniformly sample from that, and map back to the original elements using the result of the argsort.

This completely eliminated any numerical stability issues for me.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • 1
    for me I think speed would be more important than accuracy. Would you go about it the same way even if you were willing to sacrifice some accuracy for speed? – BadProgrammer Oct 11 '17 at 13:46
  • Yeah, it is a bit expensive; but you can cache the sorting if you sample relatively often to the number of elements in the distribution. I recall that numpy has had improvements in the numerical stability of its default summing algorithm; I think cumsum uses something like hierarchical-summation nowadays, and may actually be very stable without sorting? But there are restrictions to that, like only working over contiguous axes are somesuch... ill see if i can find a link. – Eelco Hoogendoorn Oct 11 '17 at 13:51
  • perhaps using an unsorted cumsum approach with np.cumsum(r, dtype=np.float128) would give you an improvement over np.random.choice in accuracy without being very different in performance? Havnt checked with np.random.choice does however. – Eelco Hoogendoorn Oct 11 '17 at 14:00
  • random choice is used for saming. cumsum is used for normalzing. I'm considering just subtracting the difference from 1.0 amd the sum then subtracting it from a random element(provided it is greater than epsilon) – BadProgrammer Oct 11 '17 at 14:17