Sorry, the read might get hard to read, but reality is cruel and many enthusiasts might easily spoil man*months
of "coding" efforts into a principally a-priori lost war. Better carefully re-assess all the a-priori known CONS / PROS of any re-engineering plans, before going to blind spend a single man*day into a principally wrong direction.
Most probably I would not post it here, if I were not exposed to a Project, where top-level academicians have spent dozens of man*years
, yes, more than a year with a team of 12+, for "producing" a processing taking ~ 26 [hr]
, which was reproducible in less than ~ 15 [min]
( and way cheaper in HPC/GPU infrastructure costs ), if designed using the proper ( hardware-performance non-devastating ) design methods...
It doesn't seem like I can proceed any differently.
Well, actually pretty hard to tell, if not impossible :
Your post seems to assume a few cardinal things that may pretty avoid getting any real benefit from moving the above sketched idea onto an indeed professional HPC / GPU infastructure.
Possible solution: parallel programming
A way easier to say it / type it, than to actually do it.
A-wish-to-run-in-true-[PARALLEL]
process scheduling remains just a wish and ( believe me, or Gene Amdahl, or other C/S veterans or not ) indeed hard process re-design is required, if your code ought get any remarkably better than in a pure-[SERIAL]
code-execution flow ( as posted above )
1 ) a pure-[SERIAL]
nature of the fileIO
can( almost )kill the game :
the not posted part about pure-[SERIAL]
file-accesses ( two files with data-points ) ... any fileIO is by nature the most expensive resource and except smart re-engineering a pure-[SERIAL]
( at best a one-stop-cost, but still ) (re-)reading in a sequential manner, so do not expect any Giant-Leap anywhere far from this in the re-engineered code. This will be always slowest and always expensive phase.
BONUS:
While this may seem as the least sexy item in the inventory list of the parallel-computing, pycuda, distributed-computing, hpc, parallelism-amdahl or whatever the slang brings next, the rudimentary truth is that making an HPC-computing indeed fast and resources efficient, both the inputs ( yes, the static files ) and the computing strategy are typically optimised for stream-processing and to best also enjoy a non-broken ( collision avoided ) data-locality, if peak performance is to get achieved. Any inefficiency in these two domains can, not just add-on, but actually FACTOR the computing expenses ( so DIVIDE performance ) and differences may easily get into several orders of magnitude ( from [ns] -> [us] -> [ms] -> [s] -> [min] -> [hr] -> [day] -> [week]
, you name them all ... )
2 ) cost / benefits may get you PAY-WAY-MORE-THAN-YOU-GET
This part is indeed your worst enemy : if a lumpsum of your efforts is higher than a sum of net benefits, GPU will not add any added value at all, or not enough, so as to cover your add-on costs.
Why?
GPU engines are SIMD devices, that are great for using a latency-masking over a vast area of a repetitively the very same block of SMX-instructions, whih needs a certain "weight"-of-"nice"-mathematics to happen locally, if they are to show any processing speedup over other problem implementation strategies - GPU devices ( not the gamers' ones, but the HPC ones, which not all cards in the class are, are they? ) deliver best for indeed small areas of data-locality ( micro-kernel matrix operations, having a very dense, best very small SMX-local "RAM
" footprint of such a dense kernel << ~ 100 [kB]
as of 2018/Q2 ).
Your "computing"-part of the code has ZERO-RE-USE of any single data-element that was ( rather expensively ) fetched from an original static storage, so almost all the benefits, that the GPU / SMX / SIMD artillery has been invented for is NOT USED AT ALL and you receive a NEGATIVE net-benefit from trying to load that sort of code onto such a heterogeneous ( NUMA complicated ) distributed computing ( yes, each GPU-device is a rather "far", "expensive" ( unless your code will harness it's SMX-resources up until almost smoke comes out of the GPU-silicon ... ) and "remote" asynchronously operated distributed-computing node, inside your global computing strategy ) system.
Any first branching of the GPU code will be devastatingly expensive in SIMD-execution costs of doing so, thus your heavily if
-ed code is syntactically fair, but performance-wise an almost killer of the game:
for i in range( 0,
len( longitude_aq )
): #______________________________________ITERATOR #1 ( SEQ-of-I-s )
currentAq = aq[i, :] # .SET
center = Coordinates( latitude_aq[i], # .SET
longitude_aq[i]
) # |
# +-------------> # EASY2VECTORISE in [i]
for j in range( 0,
len( longitude_meo )
): #- - - - - - - - - - - - - - - - - ITERATOR #2 ( SEQ-of-J-s )
currentMeo = meo[j, :] # .SET
grid_point = Coordinates( latitude_meo[j], # .SET
longitude_meo[j]
) # |
# +-------> # EASY2VECTORISE in [j]
if is_in_circle( center,
RADIUS,
grid_point
): # /\/\/\/\/\/\/\/\/\/\/\/\/\/ IF-ed SIMD-KILLER #1
if ( currentAq[TIME_AQ]
== currentMeo[TIME_MEO]
): # /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ IF-ed SIMD-KILLER #2
humidity += currentMeo[ HUMIDITY_MEO] # BEST PERF.
pressure += currentMeo[ PRESSURE_MEO] # IF SMART
temperature += currentMeo[ TEMPERATURE_MEO] # CURATED
wind_speed += currentMeo[ WIND_SPEED_MEO] # AS NON
wind_direction += currentMeo[WIND_DIRECTION_MEO] # ATOMICS
count += 1.0
if count != 0.0: # !!!!!!!!!!!!!!!!!! THIS NEVER HAPPENS
# EXCEPT WHEN ZERO DATA-POINTS WERE AVAILABLE FOR THE i-TH ZONE,
# FILE DID NOT CONTAIN ANY SUCH,
# WHICH IS FAIR,
# BUT SUCH A BLOCK OUGHT NEVER HAVE STARTED ANY COMPUTING AT ALL
# IF ASPIRING FOR INDEED BEING LOADED
# ONTO AN HPC-GRADE COMPUTING INFRASTRUCTURE ( SPONSORED OR NOT )
#
final_tmp[i, HUMIDITY_FINAL] = humidity / count
final_tmp[i, PRESSURE_FINAL] = pressure / count
final_tmp[i, TEMPERATURE_FINAL] = temperature / count
final_tmp[i, WIND_SPEED_FINAL] = wind_speed / count
final_tmp[i, WIND_DIRECTION_FINAL] = wind_direction / count
If we omit both the iterators over the domain of all-[i,j]
-s and the if
-ed crossroads, the actual "useful"-part of the computing does a very shallow amount of mathematics -- the job contains a few SLOC-s, where principally independent values are summed ( best having avoided any collision of adding operation, so could be very cheaply operated independently each to the other ( best with well ahead pre-fetched constants ) in less than a few [ns]
YES, your computing payload does not require anything more than just a few units [ns]
to execute.
The problem is in smart-engineering of the data-flow ( I like to call that a DATA-HYDRAULICS ( how to make a further incompressible flow of DATA into the { CPU | GPU | APU | *** }
-processor registers, so as to get 'em processes ) )
All the rest is easy. A smart solution of the HPC-grade DATA-HYDRAULICS typically is not.
No language, no framework will help you in this automatically. Some can release some part of the solution-engineering from your "manual" efforts, some cannot, some can even spoil a possible computing performance, due to "cheap" shortcuts in their internal design decision and compromises made, that do not benefit the same target you have - The Performance.
The Best next step?
A ) Try to better understand the limits of computing infrastructures you expect to use for your extensive ( but not intensive ( yes, just a few SLOC's per [i,j]
), which HPC-supervisors do not like to see flowing onto their operated expensive HPC-resources ).
B ) If in troubles with time + headcount + financial resouces to re-engineer the top-down DATA-HYDRAULICS solution, best re-factor your code so as to get at least into the vectorised, numpy
/ numba
( not always will numba
get remarkably farther than an already smart numpy
-vectorised code, but a quantitative test will tell the facts per inciden, not in general )
C ) If your computing-problem is expected to get re-run more often, definitely assess a re-designed pipeline from the early pre-processing of the data-storage ( the slowest part of the processing ), where a stream-based pre-processing of principally static values is possible, which could impact the resulting DATA-HYDRAULICS flow ( performance ) with a pre-computed + smart-aligned values the most. The block of a few ADD
-s down the lane will not get improved beyond a few [ns]
, as reported above, but the slow-flow can jump orders of magnitude faster, if re-arranged into a smarted flow, harnessing all available, yet "just"-[CONCURRENT]
-ly operated resources ( any attempt to try to arrange a True-[PARALLEL]
scheduling is a pure nonsense here, as the task is principally by no means a [PARALLEL]
scheduling problem, but a stream of pure-[SERIAL]
(re-)processing of data-points, where a smart, yet "just"-[CONCURRENT]
processing re-arrangement may help scale-down the resulting duration of the process ).
BONUS:
If interested in deeper reasoning about the achievable performance gains from going into N-CPU operated computing graphs, feel free to learn more about re-formulated Amdahl's Law and related issues, as posted in further details here.