9

I'm fairly new to xarray and I'm currently trying to leverage it to subset some NetCDFs. I'm running this on a shared server and would like to know how best to limit the processing power used by xarray so that it plays nicely with others. I've read through the dask and xarray documentation a bit but it doesn't seem clear to me how to set a cap on cpus/threads. Here's an example of a spatial subset:

import glob
import os
import xarray as xr

from multiprocessing.pool import ThreadPool
import dask

wd = os.getcwd()

test_data = os.path.join(wd, 'test_data')
lat_bnds = (43, 50)
lon_bnds = (-67, -80)
output = 'test_data_subset'

def subset_nc(ncfile, lat_bnds, lon_bnds, output):
    if not glob.os.path.exists(output):
        glob.os.makedirs(output)
    outfile = os.path.join(output, os.path.basename(ncfile).replace('.nc', '_subset.nc'))

    with dask.config.set(scheduler='threads', pool=ThreadPool(5)):
        ds = xr.open_dataset(ncfile, decode_times=False)

        ds_sub = ds.where(
            (ds.lon >= min(lon_bnds)) & (ds.lon <= max(lon_bnds)) & (ds.lat >= min(lat_bnds)) & (ds.lat <= max(lat_bnds)),
            drop=True)
        comp = dict(zlib=True, complevel=5)
        encoding = {var: comp for var in ds.data_vars}
        ds_sub.to_netcdf(outfile, format='NETCDF4', encoding=encoding)

list_files = glob.glob(os.path.join(test_data, '*'))
print(list_files)

for i in list_files:
    subset_nc(i, lat_bnds, lon_bnds, output)

I've tried a few variations on this by moving the ThreadPool configuration around but I still see way too much activity in the server's top (>3000% cpu activity). I'm not sure where the issue lies.

Trevor J. Smith
  • 240
  • 3
  • 7
  • Have you tried setting the the number of workers and threads per workers as dask setting? Maybe you can try something like: `with dd.LocalCluster(n_workers=1, threads_per_worker=5, memory_limit='15GiB') as cluster, dd.Client(cluster, set_as_default=True) as client:` and adapt the values for your machine? ok, I just saw, that any comments are most likely too late :D – Helmut Feb 26 '23 at 19:29
  • ">3000% cpu activity" - do you have this many CPUs, or is the reporting faulty? – mdurant Apr 12 '23 at 16:54

1 Answers1

0

This was actually solved here https://github.com/pydata/xarray/issues/2417#issuecomment-460298993
(the GitHub Issue appears to be from the Asker here)

The solution suggested by @jhamman (again an extremely good profile match) was to set the env var OMP_NUM_THREADS to a desirable thread count (suggesting about 2x the wanted core count, presumably to take advantage of modern Intel/AMD thread technologies)

ti7
  • 16,375
  • 6
  • 40
  • 68