Background
For some place of interest, we want to extract some useful information from open datasource. Take meteorology data for example, we just want to recognize the long-term temporal pattern of one point, but the datafiles sometimes cover the whole word.
Here, I use Python to extract the vertical velocity for one spot in FNL(ds083.2) file every 6 hours in one year.
In other way, I want to read the original data and save the target variables through the timeline
My attempt
import numpy as np
from netCDF4 import Dataset
import pygrib
import pandas as pd
import os, time, datetime
# Find the corresponding grid box
def find_nearest(array,value):
idx = (np.abs(array-value)).argmin()
return array[idx]
## Obtain the X, Y indice
site_x,site_y =116.4074, 39.9042 ## The area of interests.
grib='./fnl_20140101_06_00.grib2' ## Any files for obtaining lat-lon
grbs=pygrib.open(grib)
grb = grbs.select(name='Vertical velocity')[8]#
lon_list,lat_list = grb.latlons()[1][0],grb.latlons()[0].T[0]
x_indice = np.where(lon_list == find_nearest(lon_list, site_x))[0]
y_indice = np.where(lat_list == find_nearest(lat_list, site_y))[0]
def extract_vm():
files = os.listdir('.') ### All files have already save in one path
files.sort()
dict_vm = {"V":[]}
### Travesing the files
for file in files[1:]:
if file[-5:] == "grib2":
grib=file
grbs=pygrib.open(grib)
grb = grbs.select(name='Vertical velocity')[4] ## Select certain Z level
data=grb.values
data = data[y_indice,x_indice]
dict_vm['V'].append(data)
ff = pd.DataFrame(dict_vm)
return ff
extract_vm()
My thought
How to speed up the reading process? Now, I use the linear reading method, the implement time will increase linearly with the processing temporal period.
Can we split those files in several cluster and tackle with them separately with multi-core processor. Are there any other advices on my code to improve the speed?
Any comments will be appreciate!