Fast algorithm to calculate Pi in parallel

Question

I am starting to learn CUDA and I think calculating long digits of pi would be a nice, introductory project.

I have already implemented the simple Monte Carlo method which is easily parallelize-able. I simply have each thread randomly generate points on the unit square, figure out how many lie within the unit circle, and tally up the results using a reduction operation.

But that is certainly not the fastest algorithm for calculating the constant. Before, when I did this exercise on a single threaded CPU, I used Machin-like formulae to do the calculation for far faster convergence. For those interested, this involves expressing pi as the sum of arctangents and using Taylor series to evaluate the expression.

An example of such a formula:

enter image description here

Unfortunately, I found that parallelizing this technique to thousands of GPU threads is not easy. The problem is that the majority of the operations are simply doing high precision math as opposed to doing floating point operations on long vectors of data.

So I'm wondering, what is the most efficient way to calculate arbitrarily long digits of pi on a GPU?

Have you looked at this: https://sites.google.com/a/nirmauni.ac.in/cudacodes/ongoing-projects/automatic-conversion-of-source-code-for-c-to-cuda-c/converted-programs/calculate-value-of-pi — James Black, Jun 05 '12 at 02:00
I don't think that one does arbitrary precision calculations. — tskuzzy, Jun 05 '12 at 02:05
@JamesBlack: the code you have linked to is utter nonsense. It seems to be an incredibly naive automatic translation of a serial piece of C code into a serial piece of GPU code where many threads compute the identical first 1000 elements of the series expansion. Literally 99.99% of the computation performed by the code is redundant. — talonmies, Jun 05 '12 at 08:23
Erlang? I think you could use it for parallel processing. Not sure if it helps with algorithm implementation. — Code Droid, Jul 17 '12 at 22:33
See also: http://stackoverflow.com/questions/19/fastest-way-to-get-value-of-pi and http://stackoverflow.com/questions/14283270/how-to-determine-whether-my-calculation-of-pi-is-accurate — assylias, May 16 '13 at 09:45

score 19 · Answer 1 · answered Jun 05 '12 at 02:22

You should use the Bailey–Borwein–Plouffe formula

Why? First of all, you need an algorithm that can be broken down. So, the first thing that came to my mind is having a representation of pi as an infinite sum. Then, each processor just computes one term, and you sum them all in the end.

Then, it is preferable that each processor manipulates small-precision values, as opposed to very high precision ones. For example, if you want one billion decimals, and you use some of the expressions used here, like the Chudnovsky algorithm, each of your processor will need to manipulate a billion long number. That's simply not the appropriate method for a GPU.

So, all in all, the BBP formula will allow you to compute the digits of pi separately (the algorithm is very cool), and with "low precision" processors! Read the "BBP digit-extraction algorithm for π"

Advantages of the BBP algorithm for computing π This algorithm computes π without requiring custom data types having thousands or even millions of digits. The method calculates the nth digit without calculating the first n − 1 digits, and can use small, efficient data types. The algorithm is the fastest way to compute the nth digit (or a few digits in a neighborhood of the nth), but π-computing algorithms using large data types remain faster when the goal is to compute all the digits from 1 to n.

So I understand the idea that you compute all the digits you want in (embarassing) parallel. But that isn't a guarantee that this algorithm is *efficient*; each processor/GPU might be computing information that others could share. Maybe this algorithm is efficient and you just haven't told us how. But if not, you don't wan't to parallelize an inefficient algorithm just because you can. (Perhaps a more useful measure would be digits/transistor or digits/watt produced). — Ira Baxter, Jun 05 '12 at 03:03
Well, it's a "decent" algorithm. It is not the best one (records are held by other algorithms) but it is still decent. And let's also remember that OP does not wish to break records, but `I am starting to learn CUDA and I think calculating long digits of pi would be a nice, introductory project.` — B. Decoster, Jun 05 '12 at 03:43
Then its a fine scheme to try out. (I've seen people trying to make parallel programs in Python, which is an interpreter. Eh what?) — Ira Baxter, Jun 05 '12 at 09:19
Keep in mind that BBP doesn't give you decimal digits, only binary. — mhum, Jun 05 '12 at 15:13

Fast algorithm to calculate Pi in parallel

1 Answers1

Linked