I am dealing with quite big meteorological grib files containing more than 50000 messages each of them. There is information of many parameters (temperature, geopotential, vorcitity, etc.), and I need to access it. I use pygrib in order to read them.
what I do is to open the file, then read each parameter using the "select" function of pygrib. The problem is that the select function is very slow. I thought about parallelizing the read of each parameter, by reading chunks of messages (which I do not know how to do) but I think that it might be simpler to read each parameter in parallel (i.e. send the select function for each parameter to a cpu and write the output into an array).
My code is this:
import pygrib as pg
grb = pg.open(file_name.grib)
temperature = grb.select(name='Temperature')
geop_height = grb.select(name='Geopotential')
I would like to send each grb.select command into one CPU, in order to speed up the process. Is it possible? I read about the multiprocessing package, but I do not know how to use it here (I saw a few examples, like answer 3 in this one Read large file in parallel?, but I do not know how to extrapolate to my case).
I thought in something like:
import multiprocessing as mp
params = ['Temperature', 'Geopotential']
pool = mp.Pool(processes = len(params))
def readparam(grb_file, param):
return grb_file.select(name=param)
then use some loop with
pool.map(readparam, params)
to get the results.
Also, would it be possible to parallelize a single grb.select command (i.e. divide the task of selecting the all temperature messages for instance into many CPUs)?
I have access to a 64 CPU machine and this would help a lot.
Thank you in advance!