Time and memory-efficient random sampling with python/numpy

Question

I need to calculate the expected value of distance between 2 random uniformly distributed points within a unit square. This problem can be solved analytically and the answer is approximately 0.521405. What I'm trying to accomplish is to come up with this number using random number generator and Python with and without numpy. Using the below code

dist=list()
random.seed()
for each in range(100000000):
    x1=random.random()
    x2=random.random()
    y1=random.random()
    y2=random.random()
    dist.append(math.sqrt((x1-x2)**2+(y1-y2)**2))
print(round(sum(dist)/float(len(dist)),6))

It iterates 100 million times in 125 seconds on my machine, but only 4 decimal digits are correct. Now, with the numpy I created the following code

dist=list()
start_time = time.time()
np.random.seed()
for each in range(100000000):
    x = np.array((np.random.random(),np.random.random()))
    y = np.array((np.random.random(),np.random.random()))
    dist.append(np.linalg.norm(x-y))
print(round(np.mean(dist),6))

and it took 1111 seconds to iterate the same 100 million times!

As the result was correct only up to 4 decimal digits, I tried to increase the number of iterations to 1 billion using the former version, without numpy. I figured, that since each float is at most 64 bits(i'm using 64 bit Python) the list would take roughly 8 GB. However, the program used up 26GB of memory and errored out with an exception when the list had 790 million items

So I'm seeking your advice on the following:

Is there a way to make use of various numpy optimizations and actually make it work faster?
Is there a way to make the program more memory-efficient? I realize that the list is a way more complex data structure than I need for my purpose
Am I correct assuming that in order to get 6 decimal digits right I will need the number of iterations close to 10^12? (Because the standard error of N measurements is decreasing as 1/sqrt(N))

Thanks in advance!

you should make a single `x` and `y`, each with 10m values - look at the docs for `random`. then try and find a way to do the distance calculating in one go (ie. without a loop). — dan-man, May 03 '16 at 09:00

Divakar · Answer 1 · 2016-05-03T09:06:33.783

4

To answer the first two sub-questions, here's an approach that generates all the random numbers in one go and then makes use of np.einsum to replace most of np.linalg.norm's work (squaring the differentiations and sum along the rows) , keeping the rest of the code as it is. The implementation would look something like this -

N = 100000 # Edit this to change number of random points
x,y = np.random.random((2,N,2))
diffs = x-y
out = round(np.mean(np.sqrt(np.einsum('ij,ij->i',diffs,diffs))),6)

edited May 03 '16 at 09:06

answered May 03 '16 at 09:01

Divakar

218,885
19
262
358

do you really need to use `einsum` here? wouldn't `hypot` be better and simpler? – dan-man May 03 '16 at 09:07
@dan-man From performance point of view, I have found out `einsum` to be really efficient, especially to replace squaring and here we are sum-reducing as well. So, my gut feeling went with it :) – Divakar May 03 '16 at 09:09
@Divakar - but here the sum is just over an axis of length 2, which `hypot` will do internally, and do the sqrt in the same pass. Also note that `hypot` rearranges the equation to get better precision. – dan-man May 03 '16 at 09:12
@dan-man Frankly, I haven't used it much. Post that as an answer maybe? – Divakar May 03 '16 at 09:13

score 2 · Answer 2 · answered May 03 '16 at 10:12

As for the memory issue, your assumption about 'float takes 64 bits' is not quite correct. Every float object (and it is, actually, boxed object in python) will take 24 bytes. You can check it using sys.getsizeof(0.0). Thus, your program should require 3 times more space than you estimated, which is approximately what you actually have experienced.

score 1 · Accepted Answer · edited May 23 '17 at 11:58

1

I would suggest this variant without numpy, assuming you're using Python 3:

random.seed()
dist_count = 100000000
dist_sum = 0

for _ in range(dist_count):
    dx = random.random() - random.random()  # x1 - x2
    dy = random.random() - random.random()  # y1 - y2

    dist += math.sqrt( dx*dx + dy*dy )

dist_average = dist_sum / dist_count
print(round(dist_average, 6))

First, why should we store all distances in a list if we're just going to count and sum them up? It's faster to directly add each random distance to an integer variable. Also, we already know how many random distances we created, because that's what we specify in our for-loop's range, so we don't need anything like len(dist) or a separate counter either.

Furthermore we don't have to assign each coordinate a name, we can just calculate the dx and dy differences on the fly. This also helps us in the next step.

Multiplying the same value with itself is up to a magnitude faster than raising it to the power of 2 (further reading...). Therefore we of course replace a**2 with a*a. Now it becomes useful that we directly stored the differences above.

Finally we're dividing the distance sum through the count and display the result once.

edited May 23 '17 at 11:58

Community

1
1

answered May 03 '16 at 09:14

Byte Commander

6,506
6
44
71

That's really more efficient in terms of memory, and it **did not** crash on 1 billion iterations staying always under 30 MB of memory! However, the 100 million iterations took this code **75 seconds** to execute, as opposed to the numpy example suggested by Divakar. I guess this is about as fast as we can get – Nick Slavsky May 03 '16 at 10:51
Yeah, I have no experience with numpy and don't know what optimized functions it has. They're probably directly written in fast C code without the Python overhead, so I can't compete with that. Just out of curiosity, can you tell me how the accepted numpy answer scored regarding to memory consumption and scalability to even larger numbers? – Byte Commander May 03 '16 at 11:10
well, since the numpy answer generates 2 arrays of the size N and each float takes up 24bytes as posted here, it does not scale at all and runs out of memory on N=500 million. And everything below that is calculated in less than 30 seconds :) I guess I made a mistake by asking 3 questions in one: now I have 2 answers: time- and memory-wise and I'm kind of lost which one to accept. I will go with yours, because Divakar's has more votes so people visiting this page will see yours as accepted – Nick Slavsky May 03 '16 at 11:21
@NickSlavsky Thank you very much, I'm glad I could help you. I think we can sum up that for moderate test series lengths, one should prefer numpy for its performance, but large series require the flexibility and low resource usage of my approach. – Byte Commander May 03 '16 at 11:51

score 1 · Answer 4 · answered May 05 '16 at 00:56

Well, hypot is a bit faster on my laptop, 13.5 sec vs 15.2 sec for einsum Memorywise it is likely won't work with 1 billion, current version takes ~6G

Code:

import numpy as np

np.random.seed(12345)

N = 100000000 # Edit this to change number of random points
x,y = np.random.random((2,N,2))
diffs = x - y

out = round(np.mean( np.hypot(diffs[:,0], diffs[:,1]) ),6)

#out = round(np.mean(np.sqrt(np.einsum('ij,ij->i',diffs,diffs))),6)

print(out)

Time and memory-efficient random sampling with python/numpy

4 Answers4