Parallel process dealing with read a bunch of files and save into one Pandas.Dataframe

Question

Background

For some place of interest, we want to extract some useful information from open datasource. Take meteorology data for example, we just want to recognize the long-term temporal pattern of one point, but the datafiles sometimes cover the whole word.

Here, I use Python to extract the vertical velocity for one spot in FNL(ds083.2) file every 6 hours in one year.

In other way, I want to read the original data and save the target variables through the timeline

My attempt

import numpy as np
from   netCDF4 import Dataset
import pygrib
import pandas as pd
import os, time, datetime 

# Find the corresponding grid box
def find_nearest(array,value): 
    idx = (np.abs(array-value)).argmin()
    return array[idx]

## Obtain the X, Y indice
site_x,site_y =116.4074, 39.9042  ## The area of interests.
grib='./fnl_20140101_06_00.grib2' ## Any files for obtaining lat-lon
grbs=pygrib.open(grib)
grb = grbs.select(name='Vertical velocity')[8]#
lon_list,lat_list = grb.latlons()[1][0],grb.latlons()[0].T[0]
x_indice = np.where(lon_list == find_nearest(lon_list, site_x))[0]
y_indice = np.where(lat_list == find_nearest(lat_list, site_y))[0]

def extract_vm():
    files = os.listdir('.') ### All files have already save in one path
    files.sort()
    dict_vm = {"V":[]} 
    ### Travesing the files
    for file in files[1:]:
        if file[-5:] == "grib2":
            grib=file
            grbs=pygrib.open(grib)
            grb = grbs.select(name='Vertical velocity')[4] ## Select certain Z level
            data=grb.values
            data = data[y_indice,x_indice]
            dict_vm['V'].append(data)

    ff = pd.DataFrame(dict_vm)
    return ff

extract_vm()

My thought

How to speed up the reading process? Now, I use the linear reading method, the implement time will increase linearly with the processing temporal period.
Can we split those files in several cluster and tackle with them separately with multi-core processor. Are there any other advices on my code to improve the speed?

Any comments will be appreciate!

Maybe these will be of help: http://stackoverflow.com/questions/18104481/read-large-file-in-parallel and http://stackoverflow.com/questions/4047789/parallel-file-parsing-multiple-cpu-cores — Khris, Oct 14 '16 at 10:13
This type of pandas parallelization problem is addressed today by using [dask](http://dask.pydata.org). dask will let you work as you work today with pandas dataframes and you can parallelize across multiple CPU, out of core, out of memory on one machine up to cluster of machines — Zeugma, Oct 14 '16 at 18:41

Parallel process dealing with read a bunch of files and save into one Pandas.Dataframe

Background

My attempt

My thought

0 Answers0