Why is accessing values in xarray so slow?

Question

I need to access a bunch of historical weather data and use ERA5 dataset (> 1 Mio. specific locations points at specific timestamps).

I access it through a downloaded GRIB file and the xarray package. Unfortunately, the script runs quite slow: every single point needs about 250 to 300 ms ms of processing. In large parts, this is due to accessing and computing the daily values (130 to 200 ms).

Do you know why it takes so long? Can I optimize it somehow (besides parallelization)?

I already tried creating a deep copy of weatherdata_oneday, this didn't change anything (it was neither faster nor slower).

My code is:

import xarray as xr
import pandas as pd
import zoneinfo
import datetime

KELVIN_TO_DEGREE_CELSIUS = -273.15

ds = xr.open_dataset("era5_hourly_single_levels_extract.grib", engine='cfgrib', 
                     backend_kwargs={'filter_by_keys': {'typeOfLevel':'surface', 'edition': 1}})

coords_times = pd.read_csv("coords_times.csv")


all_results = {}


def to_naive(timestamp_with_timezone):
    timestamp_in_utc = timestamp_with_timezone.astimezone(datetime.timezone.utc)
    naive_timestamp = timestamp_in_utc.replace(tzinfo = None)
    return(naive_timestamp)


for index, row in coords_times.iterrows():
    loc = ds.sel(longitude=[row["lon"]], latitude=[row["lat"]], method="nearest")

    # some timestamp handling - just included for the sake of completeness
    timestamp_exact = datetime.datetime.fromisoformat(row["timestamp"])
    timestamp_start_hour = datetime.datetime(timestamp_exact.year, timestamp_exact.month, timestamp_exact.day, timestamp_exact.hour, 0, 0, tzinfo=timestamp_exact.tzinfo)
    timestamp_current_day_begin = datetime.datetime(timestamp_exact.year, timestamp_exact.month, timestamp_exact.day, 0, 0, 0, tzinfo=timestamp_exact.tzinfo)
    timestamp_current_day_end = timestamp_current_day_begin + datetime.timedelta(hours = 23)

    weatherdata_oneday = loc.sel(time = slice(to_naive(timestamp_current_day_begin), to_naive(timestamp_current_day_end)))

    # critical code part
    # this takes between 130 and 200 ms
    this_day_results = {
            "day_temperature_min" : weatherdata_oneday.t2m.min().values + KELVIN_TO_DEGREE_CELSIUS,
            "day_temperature_max" : weatherdata_oneday.t2m.max().values + KELVIN_TO_DEGREE_CELSIUS,
            "day_temperature_mean" : weatherdata_oneday.t2m.mean().values + KELVIN_TO_DEGREE_CELSIUS,
            "day_cloudcover_mean" : weatherdata_oneday.tcc.mean().values,
            "day_cloudcover_hours_low" : (weatherdata_oneday.tcc.values < 0.2).sum()
    }

    this_results = {**this_day_results}
    all_results[row["id"]] = this_results

Thank you very much!

gabolo · Answer 1 · 2023-07-18T08:56:09.803

1

Okay, I found the reason. Accessing values can be quite slow in xarray, as the data is "lazy loaded".

Adding the following before any access is performed, reduces the time required to create the resulting dict to practically 0.

ds.load()

So the resulting code is:


[...]

ds = xr.open_dataset("era5_hourly_single_levels_extract.grib", engine='cfgrib', 
                         backend_kwargs={'filter_by_keys': {'typeOfLevel':'surface', 'edition': 1}})

ds.load()

[...]

These resources helped me:

edited Jul 18 '23 at 08:56

answered Jul 14 '23 at 18:25

gabolo

21
4

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 17 '23 at 20:26
I added a larger code chunk. I hope it's clearer now. – gabolo Jul 18 '23 at 08:56

Why is accessing values in xarray so slow?

1 Answers1