How to subset climate data before downloading with Python?

Question

My typical workflow is to download large datasets (netcdf) and then subset them by a single lat/lon (gridpoint). However, I frequently just need a single gridpoint of a particular variable(s), like air temperature/precipitation, and would like to be able to efficiently subset large datasets, such as CMIP6, prior to downloading so that the download is small. So far I have tried esgf-pyclient, however, to extract a variable for a single gridpoint (for the years 1850 - 2100, ~91,675 days/rows of data) can take upwards of an hour. This slow speed defeats the purpose of subsetting prior to downloading. Internet is not an issue as my download speeds (ethernet) are > 1Gbps. If anyone has any suggestions or alternative workflows it would be appreciated!

Code I am using for esgf-pyclient:

from pyesgf.search import SearchConnection
import xarray as xr
import numpy as np

conn = SearchConnection('https://esgf-data.dkrz.de/esg-search', distrib=True)

ctx = conn.new_context(
    product = 'input',
    project = 'ISIMIP3b',
    # model = 'GFDL-ESM4',
    experiment='historical',
    variable='tasAdjust', #, tasminAdjust, tasmaxAdjust, prAdjust'
    time_frequency='day',
    data_node='esg.pik-potsdam.de'
    )
ctx.hit_count

result = ctx.search()[0]
result.dataset_id
files = result.file_context().search()
    
ds = xr.open_dataset(files[0].opendap_url).sel(lat=32.298583, lon=-97.78538710, method="nearest")

The desired output would be a 91,675 row, single column/vector of data for the desired gridpoint (lat/lon).

My suspicion is that your xarray call requires the entire horzontal layer to be downloaded. I don't see any other way of working out nearest neighbour. Can you modify the question to provide the opendap url so that it is easier for people to reproduce the issue? — Robert Wilson, Jun 04 '21 at 14:23
Could you elaborate on what you mean by "subset" and "I frequently just need a single gridpoint of a particular variable(s)" or "extract a variable for a single gridpoint (from 1850 - 2100)"? It's clear that you want to grab only the information you need from the dataset, but how exactly do you need to "subset" it? For example, do you need to split the data based on time, or location? What is a "gridpoint"? What does "1850 - 2100" represent? These questions might seem trivial to you, but I'm guessing most of us aren't data scientists or meteorologists, and aren't familiar with these terms. — Paul M., Jun 04 '21 at 19:35
Also, what's a "variable"? Can you demonstrate your desired output on a small dataset, just as an example? Again, if you could explain what you mean by some of these terms, I think other's would be more eager to help. For what it's worth, the ESGF documentation indicates that there is a [REST API](https://esgf.github.io/esgf-user-support/user_guide.html#the-esgf-search-restful-api), and it seems to accept user-defined queries. Unfortunately the documentation uses many more of these domain-specific terms I'm not familiar with, so I haven't been able to really experiment. — Paul M., Jun 04 '21 at 19:40
Fast extraction of time series from spatial data is a perennial problem. Frequently, the data is not chunked or optimized for time series extraction. If you have your own copy of the data, there are some things you can do. See for example [Faster reading of time series from netCDF?](https://stackoverflow.com/questions/19936432/faster-reading-of-time-series-from-netcdf). — Robert Davy, Jun 08 '21 at 00:26
Opendap_url is just a function from the ESGF-pyclient (pyesgf.search) which opens then url stored in the variable 'files'. I edited the post to provide a bit more detail and hopefully be more clear. And thanks for pointing out that nearest neighbor may be the problem, I hadn't even considered that. I guess I assumed the latitude and longitude data used in the nearest neighbor would just create an index to subset the larger dataset. However, that may not be the case. I will explore this further. — Tyler, Jun 09 '21 at 12:21

Tyler · Answer 1 · 2021-06-09T12:50:18.083

This seemed to work much faster:

import xarray as xr
    
folder = 'https://esg.pik-potsdam.de/thredds/dodsC/isimip_dataroot/isimip3b/input/clim_atm_sim/W5E5-ISIMIP3BASD2-5-0/MRI-ESM2-0/ssp370/tasAdjust/daily/v20210512/mri-esm2-0_r1i1p1f1_w5e5_ssp370_tasAdjust_global_daily_'
        remote_data1 = xr.open_dataset(folder + '2091_2100.nc',decode_times=False).isel(lat=31, lon=-99)
        remote_data2 = xr.open_dataset(folder + '2081_2090.nc',decode_times=False).isel(lat=31, lon=-99)
        
        ds_all = xr.concat([remote_data1, remote_data2], 
        dim = 'time',join='override',data_vars='minimal', 
        coords='minimal',compat='override')

This is a comment and not an answer. This is not best practice, keep in mind that some people search for questions with no answer and they can't find your post anymore. — mosc9575, Jun 07 '21 at 12:38

How to subset climate data before downloading with Python?

1 Answers1