1

I am dealing with quite big meteorological grib files containing more than 50000 messages each of them. There is information of many parameters (temperature, geopotential, vorcitity, etc.), and I need to access it. I use pygrib in order to read them.

what I do is to open the file, then read each parameter using the "select" function of pygrib. The problem is that the select function is very slow. I thought about parallelizing the read of each parameter, by reading chunks of messages (which I do not know how to do) but I think that it might be simpler to read each parameter in parallel (i.e. send the select function for each parameter to a cpu and write the output into an array).

My code is this:

import pygrib as pg

grb = pg.open(file_name.grib)
temperature = grb.select(name='Temperature')
geop_height = grb.select(name='Geopotential')

I would like to send each grb.select command into one CPU, in order to speed up the process. Is it possible? I read about the multiprocessing package, but I do not know how to use it here (I saw a few examples, like answer 3 in this one Read large file in parallel?, but I do not know how to extrapolate to my case).

I thought in something like:

import multiprocessing as mp

params = ['Temperature', 'Geopotential']
pool = mp.Pool(processes = len(params))

def readparam(grb_file, param):
    return grb_file.select(name=param)

then use some loop with

pool.map(readparam, params)

to get the results.

Also, would it be possible to parallelize a single grb.select command (i.e. divide the task of selecting the all temperature messages for instance into many CPUs)?

I have access to a 64 CPU machine and this would help a lot.

Thank you in advance!

Pere Munar
  • 49
  • 7

0 Answers0