Q : "Is it possible to parallelize this program (Simpson's Rule) in Python?"
A more important side of the same coin is not how to parallelise a program ( or its part ), but at what add-on costs' penalties that will actually happen. From several options how to parallelise, the best one should be selected. Here, due to large data/memory-footprints ( an adverse O( n )-scaling in SpaceDOMAIN, due to sizes above V * 8 * 1E9 [B]
(sum of memory-footprints of all objects used, incl. interim storage variables), that next, indirectly, increases the O( n )-scaling in TimeDOMAIN (duration), due to memory-I/O volume & bottle-necking of all RAM-I/O channels available ) a kind of fine-grain parallelism, called vectorisation, seems to fit best, as it adds almost zero add-on costs, yet helping reduce the RAM-I/O costs for an extremely low-complexity f(x)
-lambda, which makes memory pre-fetches of in-contiguous blocks (due to index-hopping) almost impossible to latency-mask.
There are at least two places, where a code-conversion will help, most on CPU micro-architectures, for which their native vector-instructions can get harnessed in most recent versions for indeed an HPC-grade parallel execution of numpy
-vectorised code.
A performance tuning mandatory disclaimer :
Detailed profiling will show you, for each particular target code-execution platform { x86 | ARM | ... } and its actual UMA / NUMA memory-I/O & CPU-core cache-hierarchy, where your code loses most of the time (and how successfully or how poor does the actual processor cache-hierarchy mask the real-costs of accessing vast footprints of RAM with all the adverse effects of the physical RAM memory-I/O costs as these do not, for 1E9-sizes, fit into CPU-core cache/registers ).
a )
we may spend less time in producing/storing interim objects and use the brave numpy
-code smart-vectorised trick:
y = f( np.linspace( a, # lo-bound
b, # hi-bound
N + 1 # steps
) # no storage of a temporary, yet HUGE object x ~ 8[GB] in RAM
) # +ask lambda to in-flight compute & create just y = f( x[:] )
If in doubt, feel free to read more about the costs of various kinds of computing / storage related access latencies.
b )
we may reduce, at least some parts of the repetitive memory-access patterns, that ( as expressed above, due to sizes of about ~1E9, cannot remain in-cache and will have to be re-executed and re-fetched from physical RAM ) still meet the computation:
# proper fusing of the code against non-even N is left for clarity of vectorised
# and left to the kind user
#_________________________________________________________________ double steps
resultado = ( 4 * np.sum( f( np.linspace( a + dx, # lo-bound
b - dx, # hi-bound
( N / 2 ) + 1 # inner double steps
)
) #--------- lambda on a smaller, RAM/cache-compact object
) #----------- numpy-controls the summation over contiguous RAM
+ 2 * np.sum( f( np.linspace( a + dx + dx, # lo-bound
b - dx - dx, # hi-bound
( N / 2 ) + 1 # inner double steps
)
) #--------- lambda on a smaller, RAM/cache-compact object
) #----------- numpy-controls the summation overcontiguous RAM
+ f( a )
+ f( b )
) * dx / 3
While the mock-up code does not aspire to solve all corner cases, the core-benefit arrives from using RAM-contiguous memory-layouts, that are very efficiently numpy.sum()
-ed and from avoiding of replicated-visits to memory-areas, re-visited just due to imperatively dictated (non-contiguous) index-jumping (numpy
can optimise some of its own indexing so as to maximise memory-access pattern/cache-hits, yet the "outer", coded index-jumping is almost always beyond of the reach of such smart, but hard-wired, numpy
-optimisation know-how ( the less a silicon-based thinking or a clairvoyance ;o) )