I need to access a bunch of historical weather data and use ERA5 dataset (> 1 Mio. specific locations points at specific timestamps).
I access it through a downloaded GRIB file and the xarray package. Unfortunately, the script runs quite slow: every single point needs about 250 to 300 ms ms of processing. In large parts, this is due to accessing and computing the daily values (130 to 200 ms).
Do you know why it takes so long? Can I optimize it somehow (besides parallelization)?
I already tried creating a deep copy of weatherdata_oneday
, this didn't change anything (it was neither faster nor slower).
My code is:
import xarray as xr
import pandas as pd
import zoneinfo
import datetime
KELVIN_TO_DEGREE_CELSIUS = -273.15
ds = xr.open_dataset("era5_hourly_single_levels_extract.grib", engine='cfgrib',
backend_kwargs={'filter_by_keys': {'typeOfLevel':'surface', 'edition': 1}})
coords_times = pd.read_csv("coords_times.csv")
all_results = {}
def to_naive(timestamp_with_timezone):
timestamp_in_utc = timestamp_with_timezone.astimezone(datetime.timezone.utc)
naive_timestamp = timestamp_in_utc.replace(tzinfo = None)
return(naive_timestamp)
for index, row in coords_times.iterrows():
loc = ds.sel(longitude=[row["lon"]], latitude=[row["lat"]], method="nearest")
# some timestamp handling - just included for the sake of completeness
timestamp_exact = datetime.datetime.fromisoformat(row["timestamp"])
timestamp_start_hour = datetime.datetime(timestamp_exact.year, timestamp_exact.month, timestamp_exact.day, timestamp_exact.hour, 0, 0, tzinfo=timestamp_exact.tzinfo)
timestamp_current_day_begin = datetime.datetime(timestamp_exact.year, timestamp_exact.month, timestamp_exact.day, 0, 0, 0, tzinfo=timestamp_exact.tzinfo)
timestamp_current_day_end = timestamp_current_day_begin + datetime.timedelta(hours = 23)
weatherdata_oneday = loc.sel(time = slice(to_naive(timestamp_current_day_begin), to_naive(timestamp_current_day_end)))
# critical code part
# this takes between 130 and 200 ms
this_day_results = {
"day_temperature_min" : weatherdata_oneday.t2m.min().values + KELVIN_TO_DEGREE_CELSIUS,
"day_temperature_max" : weatherdata_oneday.t2m.max().values + KELVIN_TO_DEGREE_CELSIUS,
"day_temperature_mean" : weatherdata_oneday.t2m.mean().values + KELVIN_TO_DEGREE_CELSIUS,
"day_cloudcover_mean" : weatherdata_oneday.tcc.mean().values,
"day_cloudcover_hours_low" : (weatherdata_oneday.tcc.values < 0.2).sum()
}
this_results = {**this_day_results}
all_results[row["id"]] = this_results
Thank you very much!