Optimize estimating Pi function C++

Question

I've wrote a program, that approximates Pi using the Monte-Carlo method. It is working fine, but I wonder if I can make it work better and faster, because when inserting something like ~n = 100000000 and bigger from that point, it takes some time to do the calculations and print the result.

I've imagined how I could try to approximate it better by doing a median for n results, but considering how slow my algorithm is for big numbers, I decided not doing so.

Basically, the question is: How can I make this function work faster?

Here is the code that I've got so far:

double estimate_pi(double n)
{
    int i;
    double x,y, distance;
    srand( (unsigned)time(0));
    double num_point_circle = 0;
    double num_point_total = 0;
    double final;

    for (i=0; i<n; i++)
    {
        x = (double)rand()/RAND_MAX;
        y = (double)rand()/RAND_MAX;
        distance = sqrt(x*x + y*y);
        if (distance <= 1)
        {
            num_point_circle+=1;
        }
        num_point_total+=1;
    }
    final = ((4 * num_point_circle) / num_point_total);
    return final;
}

note that there are much faster converging methods to estimate Pi. Using monte carlo to estimate pi is a nice exercise, but not much more than that. Anyhow, of course you can use it as an exercise to make it faster. Instead of a quarter circle you can try to use a smaller slice — 463035818_is_not_an_ai, Nov 26 '21 at 11:28
Very minor improvements.... a) num_point_total = n so no need to calculate it in the loop. b) Use ++ rather than +=1 (although maybe this is optimised by the compiler). c) use int for num_point_circle - be surprised if you see much change though with n= 1000 by these changes - what is "some time" on your machine? — Roger Cigol, Nov 26 '21 at 11:46
What do you hope to gain by making the method faster ? In any case you will *never* reach, say, ten digits of accuracy, and you already know that the method works. So what ? — , Nov 26 '21 at 13:15
Try OpenMP to parallel computations. If you do so, consider using something different that `rand()` which is not threadsafe. You may also try to replace `rand()` with something different as it is quite slow. — Michał Walenciak, Nov 26 '21 at 13:53
@MichałWalenciak: as I said in my answer, making the algorithm faster is virtually hopeless. Parallelizing on, say 64 cores will at best yield a speedup of 8, not even enough for a new exact digit. — , Nov 26 '21 at 14:49

score 3 · Accepted Answer · answered Nov 26 '21 at 11:52

An obvious (small) speedup is to get rid of the square root operation. sqrt(x*x + y*y) is exactly smaller than 1, when x*x + y*y is also smaller than 1. So you can just write:

double distance2 = x*x + y*y;
if (distance2 <= 1) {
    ...
}

That might gain something, as computing the square root is an expensive operation, and takes much more time (~4-20 times slower than one addition, depending on CPU, architecture, optimization level, ...).

You should also use integer values for variables where you count, like num_point_circle, n and num_point_total. Use int, long or long long for them.

The Monte Carlo algorithm is also embarrassingly parallel. So an idea to speed it up would be to run the algorithm with multiple threads, each one processes a fraction of n, and then at the end sum up the number of points that are inside the circle.

Additionally you could try to improve the speed with SIMD instructions.

And as last, there are much more efficient ways of computing Pi. With the Monte Carlo method you can do millions of iterations, and only receive an accuracy of a couple of digits. With better algorithms its possible to compute thousands, millions or billions of digits.

E.g. you could read up on the on the website https://www.i4cy.com/pi/

With just the simple Machin formula, 16 arctan(1/5)-4 arctan(1/239), you already reach double-precision accuracy with like 30 terms. — , Nov 26 '21 at 14:54

463035818_is_not_an_ai · Answer 2 · 2021-11-26T12:01:13.717

First of all, you should modify the code to consider how much the estimate changes after a certain number of iterations and stop when it reached a satisfiying accuracy.

For your code as is, with a fixed number of iterations, there isnt much to optimize that the compiler cannot do better than you. One minor thing you can improve is to drop computing the square root in every iteration. You don't need it, because sqrt(x) <= 1 is true if x <= 1.

Very roughly speaking you can expect a good estimate from your Monte Carlo method once your x,y points are uniformly distributed in the quarter circle. Considering that, there is a much simpler way to get points uniformly distributed without randomness: Use a uniform grid and count how many points are inside the circle and how many are outside.

I would expect this to converge much better (when decreasing the grid spacing) than Monte Carlo. However, you can also try to use it to speed up your Monte Carlo code by starting from counts of num_point_circle and num_point_total obtained from points on a uniform grid and then incrementing the counters by continuing with randomly distributed points.

score 2 · Answer 3 · answered Nov 26 '21 at 13:12

As the error decreases as the inverse of the square root of the number of samples, to gain a single digit you need one hundred times more samples. No micro-optimization will help significantly. Just consider that this method is of no use to effectively compute π.

If you insist: https://en.wikipedia.org/wiki/Variance_reduction.

Ol Sen · Answer 4 · 2021-11-26T15:56:02.817

How can I make this function work faster? not more precise!

when Monte Carlo is an estimate anyway and the final multiplication is a multiple of 2,4,8 and so on you can also do bit operations. Any if statement makes it slow, so trying to get rid of it. But when we increase 1 or nothing (0) anyway you can get rid of the statement and reduce it to simple math, which should be faster. When i is initialised before the loop, and we are counting up inside the loop, it can also be a while loop.

#include <time.h>
double estimate_alt_pi(double n){
    uint64_t num_point_total = 0;
    double x, y;
    srand( (unsigned)time(0));
    uint64_t num_point_circle = 0;
    while (num_point_total<n) {
        x = (double)rand()/RAND_MAX;
        y = (double)rand()/RAND_MAX;
        num_point_circle += (sqrt(x*x + y*y) <= 1);
        num_point_total++; // +1.0 we don't need 'i'
    }
    return ((num_point_circle << 2) / (double)num_point_total); // x<<2 == x * 4
}

Benchmarked with Xcode on a mac, looks like.

extern uint64_t dispatch_benchmark(size_t count, void (^block)(void));

int main(int argc, const char * argv[]) {
    size_t count = 1000000;
    double input = 1222.52764523423;
    __block double resultA;
    __block double resultB;
    uint64_t t = dispatch_benchmark(count,^{
        resultA = estimate_pi(input);
    });
    uint64_t t2 = dispatch_benchmark(count,^{
        resultB = estimate_alt_pi(input);
    });
    fprintf(stderr,"estimate_____pi=%f time used=%llu\n",resultA, t);
    fprintf(stderr,"estimate_alt_pi=%f time used=%llu\n",resultB, t2);
    return 0;
}

~1,35 times faster, or takes ~73% of the original time
but significant less difference when the given number is lower. And also the whole algorithm works properly only up to inputs with maximal 7 digits which is because of the used data types. Below 2 digits it is even slower. So the whole Monte Carlo algo is indeed not really worth digging deeper, when it's only about speed while keeping some kind of reliability.

Literally nothing will be faster than using a #define or static with a fixed number as Pi, but that was not your question.

score 1 · Answer 5 · answered Nov 26 '21 at 14:18

A first level improvement would be to sample on a regular grid, using brute force algorithm, as suggested by 463035818_is_not_a_number.

The next level improvement would be to "draw" just a semi circle for each 'x', counting how many points must be below y = sqrt(radius*radius - x*x). This reduces the complexity from O(N^2) to O(N).

With the grid size == radius of 10, 100, 1000, 10000, etc, one should have about one digit of each step.

One of the problems with the rand() function being, that the numbers soon begin to repeat -- with regular grid and this O(N) turned problem we can even simulate a grid of the size of 2^32 in quite a reasonable time.

With ideas from Bresenham's Circle drawing algorithm we can even quickly evaluate if the candidate (x+1, y_previous) or (x+1, y_previous-1) should be selected, as only one of those is inside the circle for the first octant. For second some other hacks are needed to avoid sqrt.

Spektre · Answer 6 · 2021-11-28T14:14:28.727

Your code is very bad from performance aspect as:

x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;

is converting between int and double, also using integer division ... Why not use Random() instead? Also this:

for (i=0; i<n; i++)

is a bad idea as n is double and i is int so either store n to int variable at start or change the header to int n. Btw why are you computing num_point_total when you already got n ? Also:

num_point_circle += (sqrt(x*x + y*y) <= 1);

is a bad idea why sqrt? you know 1^2 = 1 so you can simply do:

num_point_circle += (x*x + y*y <= 1);

Why not do continuous computing? ... so what you need to implement is:

load of state at app start
computation either in timer or OnIdle event

so in each iteration/even you will do N iterations of Pi (adding to some global sum and count)
save of state at app exit

Beware Monte Carlo Pi computation converges very slowly and you will hit floating point accuracy problems once the sum grows too big

Here small example I did a long time ago doing continuous Monte Carlo...

Form cpp:

//$$---- Form CPP ----
//---------------------------------------------------------------------------
#include <vcl.h>
#include <math.h>
#pragma hdrstop
#include "Unit1.h"
#include "performance.h"
//---------------------------------------------------------------------------
#pragma package(smart_init)
#pragma resource "*.dfm"
TForm1 *Form1;
//---------------------------------------------------------------------------
int i=0,n=0,n0=0;
//---------------------------------------------------------------------------
void __fastcall TForm1::Idleloop(TObject *Sender, bool &Done)
    {
    int j;
    double x,y;
    for (j=0;j<10000;j++,n++)
        {
        x=Random(); x*=x;
        y=Random(); y*=y;
        if (x+y<=1.0) i++;
        }
    }
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner):TForm(Owner)
    {
    tbeg();
    Randomize();
    Application->OnIdle = Idleloop;
    }
//-------------------------------------------------------------------------
void __fastcall TForm1::Timer1Timer(TObject *Sender)
    {
    double dt;
    AnsiString txt;
    txt ="ref = 3.1415926535897932384626433832795\r\n";
    if (n) txt+=AnsiString().sprintf("Pi  = %.20lf\r\n",4.0*double(i)/double(n));
    txt+=AnsiString().sprintf("i/n = %i / %i\r\n",i,n);
    dt=tend();
    if (dt>1e-100) txt+=AnsiString().sprintf("IPS = %8.0lf\r\n",double(n-n0)*1000.0/dt);
    tbeg(); n0=n;
    mm_log->Text=txt;
    }
//---------------------------------------------------------------------------

Form h:

//$$---- Form HDR ----
//---------------------------------------------------------------------------

#ifndef Unit1H
#define Unit1H
//---------------------------------------------------------------------------
#include <Classes.hpp>
#include <Controls.hpp>
#include <StdCtrls.hpp>
#include <Forms.hpp>
#include <ExtCtrls.hpp>
//---------------------------------------------------------------------------
class TForm1 : public TForm
{
__published:    // IDE-managed Components
    TMemo *mm_log;
    TTimer *Timer1;
    void __fastcall Timer1Timer(TObject *Sender);
private:    // User declarations
public:     // User declarations
    __fastcall TForm1(TComponent* Owner);
void __fastcall TForm1::Idleloop(TObject *Sender, bool &Done);
};
//---------------------------------------------------------------------------
extern PACKAGE TForm1 *Form1;
//---------------------------------------------------------------------------
#endif

Form dfm:

object Form1: TForm1
  Left = 0
  Top = 0
  Caption = 'Project Euler'
  ClientHeight = 362
  ClientWidth = 619
  Color = clBtnFace
  Font.Charset = OEM_CHARSET
  Font.Color = clWindowText
  Font.Height = 14
  Font.Name = 'System'
  Font.Pitch = fpFixed
  Font.Style = [fsBold]
  OldCreateOrder = False
  PixelsPerInch = 96
  TextHeight = 14
  object mm_log: TMemo
    Left = 0
    Top = 0
    Width = 619
    Height = 362
    Align = alClient
    ScrollBars = ssBoth
    TabOrder = 0
  end
  object Timer1: TTimer
    Interval = 100
    OnTimer = Timer1Timer
    Left = 12
    Top = 10
  end
end

So you should add the save/load of state ...

As mentioned there are much much better ways of obtaining Pi like BBP

Also the code above use my time measurement heder so here it is:

//---------------------------------------------------------------------------
//--- Performance counter time measurement: 2.01 ----------------------------
//---------------------------------------------------------------------------
#ifndef _performance_h
#define _performance_h
//---------------------------------------------------------------------------
const int   performance_max=64;                 // push urovni
double      performance_Tms=-1.0,               // perioda citaca [ms]
            performance_tms=0.0,                // zmerany cas po tend [ms]
            performance_t0[performance_max];    // zmerane start casy [ms]
int         performance_ix=-1;                  // index aktualneho casu
//---------------------------------------------------------------------------
void tbeg(double *t0=NULL)  // mesure start time
    {
    double t;
    LARGE_INTEGER i;
    if (performance_Tms<=0.0)
        {
        for (int j=0;j<performance_max;j++) performance_t0[j]=0.0;
        QueryPerformanceFrequency(&i); performance_Tms=1000.0/double(i.QuadPart);
        }
    QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
    if (t0) { t0[0]=t; return; }
    performance_ix++;
    if ((performance_ix>=0)&&(performance_ix<performance_max)) performance_t0[performance_ix]=t;
    }
//---------------------------------------------------------------------------
void tpause(double *t0=NULL)    // stop counting time between tbeg()..tend() calls
    {
    double t;
    LARGE_INTEGER i;
    QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
    if (t0) { t0[0]=t-t0[0]; return; }
    if ((performance_ix>=0)&&(performance_ix<performance_max)) performance_t0[performance_ix]=t-performance_t0[performance_ix];
    }
//---------------------------------------------------------------------------
void tresume(double *t0=NULL)   // resume counting time between tbeg()..tend() calls
    {
    double t;
    LARGE_INTEGER i;
    QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
    if (t0) { t0[0]=t-t0[0]; return; }
    if ((performance_ix>=0)&&(performance_ix<performance_max)) performance_t0[performance_ix]=t-performance_t0[performance_ix];
    }
//---------------------------------------------------------------------------
double tend(double *t0=NULL)    // return duration [ms] between matching tbeg()..tend() calls
    {
    double t;
    LARGE_INTEGER i;
    QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
    if (t0) { t-=t0[0]; performance_tms=t; return t; }
    if ((performance_ix>=0)&&(performance_ix<performance_max)) t-=performance_t0[performance_ix]; else t=0.0;
    performance_ix--;
    performance_tms=t;
    return t;
    }
//---------------------------------------------------------------------------
double tper(double *t0=NULL)    // return duration [ms] between tper() calls
    {
    double t,tt;
    LARGE_INTEGER i;
    if (performance_Tms<=0.0)
        {
        for (int j=0;j<performance_max;j++) performance_t0[j]=0.0;
        QueryPerformanceFrequency(&i); performance_Tms=1000.0/double(i.QuadPart);
        }
    QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
    if (t0) { tt=t-t0[0]; t0[0]=t; performance_tms=tt; return tt; }
    performance_ix++;
    if ((performance_ix>=0)&&(performance_ix<performance_max))
        {
        tt=t-performance_t0[performance_ix];
        performance_t0[performance_ix]=t;
        }
    else { t=0.0; tt=0.0; };
    performance_ix--;
    performance_tms=tt;
    return tt;
    }
//---------------------------------------------------------------------------
AnsiString tstr()
    {
    AnsiString s;
    s=s.sprintf("%8.3lf",performance_tms); while (s.Length()<8) s=" "+s; s="["+s+" ms]";
    return s;
    }
//---------------------------------------------------------------------------
AnsiString tstr(int N)
    {
    AnsiString s;
    s=s.sprintf("%8.3lf",performance_tms/double(N)); while (s.Length()<8) s=" "+s; s="["+s+" ms]";
    return s;
    }
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
#endif
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------

And here result after few seconds:

result

where IPS is iterations per second, i,n are the global variables holding actual state of computation, Pi is actual approximation and ref is refernce Pi value for comparison. As I computing in OnIdle and showing up in OnTimer there is no need for any locks as its all single thread. For more speed you could launch more computing threads however then you need to add multi threading lock mechanisms and have the global state as volatile.

In case you are doing console app (no forms) you still can do this just convert your code to infinite loop, recompute and output Pi value once in ??? iterations and stop on some key hit like escape.

Optimize estimating Pi function C++

6 Answers6

Linked