Parallel programming on a nested for loop in Python using PyCuda (or else?)

Question

Part of my python function looks like this:

for i in range(0, len(longitude_aq)):
    center = Coordinates(latitude_aq[i], longitude_aq[i])
    currentAq = aq[i, :]
    for j in range(0, len(longitude_meo)):
        currentMeo = meo[j, :]
        grid_point = Coordinates(latitude_meo[j], longitude_meo[j])
        if is_in_circle(center, RADIUS, grid_point):
            if currentAq[TIME_AQ] == currentMeo[TIME_MEO]:
                humidity += currentMeo[HUMIDITY_MEO]
                pressure += currentMeo[PRESSURE_MEO]
                temperature += currentMeo[TEMPERATURE_MEO]
                wind_speed += currentMeo[WIND_SPEED_MEO]
                wind_direction += currentMeo[WIND_DIRECTION_MEO]
                count += 1.0

    if count != 0.0:
        final_tmp[i, HUMIDITY_FINAL] = humidity/count
        final_tmp[i, PRESSURE_FINAL] = pressure/count
        final_tmp[i, TEMPERATURE_FINAL] = temperature/count
        final_tmp[i, WIND_SPEED_FINAL] = wind_speed/count
        final_tmp[i, WIND_DIRECTION_FINAL] = wind_direction/count

    humidity, pressure, temperature, wind_speed, wind_direction, count = 0.0, 0.0, 0.0, 0.0, 0.0, 0.0

final.loc[:, :] = final_tmp[:, :]

Problem: len(longitude_aq) is approx. 320k and len(longitude_meo) is 7millions. Which brings this code to somewhere close to 2100 billions iterations...

I have to iterate over one file (the longitude_aq one) and then compute some average value, iterating over the second file (the longitude_meo one) given some features extracted from the first file.

It doesn't seem like I can proceed any differently.

Possible solution: parallel programming. My university allows me access to their HPC. They have several nodes + several GPUs accessible (list here for the GPUs and here for the CPUs)

Target: Having no experience in CUDA programming with Python, I am wondering what would be the easiest way to transform my code into something runnable by the HPC, so that the computation time drops drastically.

Hi, you may read the [PyCUDA documentation](https://documen.tician.de/pycuda/) first. — YesThatIsMyName, May 15 '18 at 13:20

score -4 · Answer 1 · answered Oct 23 '19 at 20:28

^{Sorry, the read might get hard to read, but reality is cruel and many enthusiasts might easily spoil man*months of "coding" efforts into a principally a-priori lost war. Better carefully re-assess all the a-priori known CONS / PROS of any re-engineering plans, before going to blind spend a single man*day into a principally wrong direction.

Most probably I would not post it here, if I were not exposed to a Project, where top-level academicians have spent dozens of man*years, yes, more than a year with a team of 12+, for "producing" a processing taking ~ 26 [hr], which was reproducible in less than ~ 15 [min] ( and way cheaper in HPC/GPU infrastructure costs ), if designed using the proper ( hardware-performance non-devastating ) design methods...}

It doesn't seem like I can proceed any differently.

Well, actually pretty hard to tell, if not impossible :

Your post seems to assume a few cardinal things that may pretty avoid getting any real benefit from moving the above sketched idea onto an indeed professional HPC / GPU infastructure.

Possible solution: parallel programming

A way easier to say it / type it, than to actually do it.

A-wish-to-run-in-true-[PARALLEL] process scheduling remains just a wish and ( believe me, or Gene Amdahl, or other C/S veterans or not ) indeed hard process re-design is required, if your code ought get any remarkably better than in a pure-[SERIAL] code-execution flow ( as posted above )

1 ) a pure-`[SERIAL]` nature of the `fileIO` can_{( almost )}kill the game :

the not posted part about pure-[SERIAL] file-accesses ( two files with data-points ) ... any fileIO is by nature the most expensive resource and except smart re-engineering a pure-[SERIAL] ( at best a one-stop-cost, but still ) (re-)reading in a sequential manner, so do not expect any Giant-Leap anywhere far from this in the re-engineered code. This will be always slowest and always expensive phase.

BONUS:
While this may seem as the least sexy item in the inventory list of the parallel-computing, pycuda, distributed-computing, hpc, parallelism-amdahl or whatever the slang brings next, the rudimentary truth is that making an HPC-computing indeed fast and resources efficient, both the inputs ( yes, the static files ) and the computing strategy are typically optimised for stream-processing and to best also enjoy a non-broken ( collision avoided ) data-locality, if peak performance is to get achieved. Any inefficiency in these two domains can, not just add-on, but actually FACTOR the computing expenses ( so DIVIDE performance ) and differences may easily get into several orders of magnitude ( from [ns] -> [us] -> [ms] -> [s] -> [min] -> [hr] -> [day] -> [week], you name them all ... )

2 ) cost / benefits may get you PAY-WAY-MORE-THAN-YOU-GET

This part is indeed your worst enemy : if a lumpsum of your efforts is higher than a sum of net benefits, GPU will not add any added value at all, or not enough, so as to cover your add-on costs.

Why?

GPU engines are SIMD devices, that are great for using a latency-masking over a vast area of a repetitively the very same block of SMX-instructions, whih needs a certain "weight"-of-"nice"-mathematics to happen locally, if they are to show any processing speedup over other problem implementation strategies - GPU devices ( not the gamers' ones, but the HPC ones, which not all cards in the class are, are they? ) deliver best for indeed small areas of data-locality ( micro-kernel matrix operations, having a very dense, best very small SMX-local "RAM" footprint of such a dense kernel << ~ 100 [kB] as of 2018/Q2 ).

Your "computing"-part of the code has ZERO-RE-USE of any single data-element that was ( rather expensively ) fetched from an original static storage, so almost all the benefits, that the GPU / SMX / SIMD artillery has been invented for is NOT USED AT ALL and you receive a NEGATIVE net-benefit from trying to load that sort of code onto such a heterogeneous ( NUMA complicated ) distributed computing ( yes, each GPU-device is a rather "far", "expensive" ( unless your code will harness it's SMX-resources up until almost smoke comes out of the GPU-silicon ... ) and "remote" asynchronously operated distributed-computing node, inside your global computing strategy ) system.

Any first branching of the GPU code will be devastatingly expensive in SIMD-execution costs of doing so, thus your heavily if-ed code is syntactically fair, but performance-wise an almost killer of the game:

for i in range( 0,
                len( longitude_aq )
                ): #______________________________________ITERATOR #1 ( SEQ-of-I-s )
    currentAq =                        aq[i, :]           # .SET
    center    = Coordinates(  latitude_aq[i],             # .SET
                             longitude_aq[i]
                             ) #          |
    #                                     +-------------> # EASY2VECTORISE in [i]

    for j in range( 0,
                    len( longitude_meo )
                    ): #- - - - - - - - - - - - - - - - - ITERATOR #2 ( SEQ-of-J-s )
        currentMeo =                        meo[j, :]     # .SET
        grid_point = Coordinates(  latitude_meo[j],       # .SET
                                  longitude_meo[j]
                                  ) #           |
        #                                       +-------> # EASY2VECTORISE in [j]
        if is_in_circle( center,
                         RADIUS,
                         grid_point
                         ): # /\/\/\/\/\/\/\/\/\/\/\/\/\/ IF-ed SIMD-KILLER #1
            if (   currentAq[TIME_AQ]
               == currentMeo[TIME_MEO]
                  ): # /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ IF-ed SIMD-KILLER #2
                humidity       += currentMeo[      HUMIDITY_MEO] # BEST PERF.
                pressure       += currentMeo[      PRESSURE_MEO] #   IF SMART
                temperature    += currentMeo[   TEMPERATURE_MEO] #   CURATED
                wind_speed     += currentMeo[    WIND_SPEED_MEO] #   AS NON
                wind_direction += currentMeo[WIND_DIRECTION_MEO] #   ATOMICS
                count          += 1.0

    if          count          != 0.0: # !!!!!!!!!!!!!!!!!! THIS NEVER HAPPENS
        # EXCEPT WHEN ZERO DATA-POINTS WERE AVAILABLE FOR THE i-TH ZONE,
        #        FILE DID NOT CONTAIN ANY SUCH,
        #        WHICH IS FAIR,
        #    BUT SUCH A BLOCK OUGHT NEVER HAVE STARTED ANY COMPUTING AT ALL
        #     IF ASPIRING FOR INDEED BEING LOADED
        #        ONTO AN HPC-GRADE COMPUTING INFRASTRUCTURE ( SPONSORED OR NOT )
        #
        final_tmp[i, HUMIDITY_FINAL]       = humidity       / count
        final_tmp[i, PRESSURE_FINAL]       = pressure       / count
        final_tmp[i, TEMPERATURE_FINAL]    = temperature    / count
        final_tmp[i, WIND_SPEED_FINAL]     = wind_speed     / count
        final_tmp[i, WIND_DIRECTION_FINAL] = wind_direction / count

If we omit both the iterators over the domain of all-[i,j]-s and the if-ed crossroads, the actual "useful"-part of the computing does a very shallow amount of mathematics -- the job contains a few SLOC-s, where principally independent values are summed ( best having avoided any collision of adding operation, so could be very cheaply operated independently each to the other ( best with well ahead pre-fetched constants ) in less than a few [ns] YES, your computing payload does not require anything more than just a few units [ns] to execute.

The problem is in smart-engineering of the data-flow ( I like to call that a DATA-HYDRAULICS ( how to make a further incompressible flow of DATA into the { CPU | GPU | APU | *** }-processor registers, so as to get 'em processes ) )

All the rest is easy. A smart solution of the HPC-grade DATA-HYDRAULICS typically is not.

No language, no framework will help you in this automatically. Some can release some part of the solution-engineering from your "manual" efforts, some cannot, some can even spoil a possible computing performance, due to "cheap" shortcuts in their internal design decision and compromises made, that do not benefit the same target you have - The Performance.

The Best next step?

A ) Try to better understand the limits of computing infrastructures you expect to use for your extensive ( but not intensive ( yes, just a few SLOC's per [i,j] ), which HPC-supervisors do not like to see flowing onto their operated expensive HPC-resources ).

B ) If in troubles with time + headcount + financial resouces to re-engineer the top-down DATA-HYDRAULICS solution, best re-factor your code so as to get at least into the vectorised, numpy / numba ( not always will numba get remarkably farther than an already smart numpy-vectorised code, but a quantitative test will tell the facts per inciden, not in general )

C ) If your computing-problem is expected to get re-run more often, definitely assess a re-designed pipeline from the early pre-processing of the data-storage ( the slowest part of the processing ), where a stream-based pre-processing of principally static values is possible, which could impact the resulting DATA-HYDRAULICS flow ( performance ) with a pre-computed + smart-aligned values the most. The block of a few ADD-s down the lane will not get improved beyond a few [ns], as reported above, but the slow-flow can jump orders of magnitude faster, if re-arranged into a smarted flow, harnessing all available, yet "just"-[CONCURRENT]-ly operated resources ( any attempt to try to arrange a True-[PARALLEL] scheduling is a pure nonsense here, as the task is principally by no means a [PARALLEL] scheduling problem, but a stream of pure-[SERIAL] (re-)processing of data-points, where a smart, yet "just"-[CONCURRENT] processing re-arrangement may help scale-down the resulting duration of the process ).

BONUS:
If interested in deeper reasoning about the achievable performance gains from going into N-CPU operated computing graphs, feel free to learn more about re-formulated Amdahl's Law and related issues, as posted in further details here.