2

I would like to retrieve GOES-16 ABI radiance data for predetermined locations (about 10,000 points per individual image) for an entire year. Each day has ~100 individual images. I have all required ABI data (in netCDF format) on disk already. The points I'd like to extract are given in terms of the row and column of the netCDF array, so in principle, retrieving the correct radiances is an array indexing operation.

However, all my attempts at doing this have been painstakingly slow (order of 10+ minutes for a single day). I've been trying to use xarray, as follows.

import xarray as xr, pandas as pd

df = pd.read_csv("selected_pixels/20190101.csv")
ds = xr.open_mfdataset('noaa-goes16/ABI-L2-MCMIPF/2019/001/*/*.nc', parallel=True,
                      combine='nested', concat_dim='t')

t = xr.DataArray(df.time_id.values, dims="s")
x = xr.DataArray(df.col.values, dims="s")
y = xr.DataArray(df.row.values, dims="s")

a_df = ds[[f"CMI_C{str(i).rjust(2,'0')}" for i in range(1,17)]].isel(t=t,x=x, y=y).to_dataframe()

I'm fortunate enough to have multiple processors at my disposal: I would highly appreciate any suggestions to speed up this operation.

  • Would https://stackoverflow.com/questions/29135885/netcdf4-extract-for-subset-of-lat-lon#29136166 work? – Jim D Apr 26 '21 at 10:55
  • Thank you for your response. Are you specifically referring to the one answer from 'Favo' or to another? – flymetothemoon200 Apr 26 '21 at 11:06
  • I was thinking favo. The best solution is to not read the entire data file into memory because you only are interested in a subset. Reading a large file into memory and then discarding most of it wastes time. A library that lets you selectively grab parts while reading is much better. I'm not sure if favo's solution avoids reading the entire file. – Jim D Apr 26 '21 at 14:36
  • I am curious why you are processing an entire day, yet are extracting specific times by index. What happens if you remove the t=t from the isel and extract the entire day for each lat & lon ? – Robert Davy Apr 27 '21 at 04:05
  • These performance questions can be a little difficult to debug without seeing the data. Due to xarray's lazy evaluation, it could be that it's actually the `open_mfdataset` call that's really slow. You can test easily this by e.g. seeing if you can plot the data for a single data (`ds["CMI_C01"].isel(time=0).plot.imshow()`). Secondly, I've noticed that `isel` can be slow as well. You can always fallback on the underlying dask/numpy array via the `.data` attribute: `ds["CMI_C01"].data[t, x, y]` since you're already using `isel` anyway (but check the dimension order!). – Huite Bootsma May 02 '21 at 22:35
  • Thanks all for your suggestions. Indeed @HuiteBootsma, `isel` appears to be quite slow. Using your suggestion already leads to a 2 orders of magnitude speed-up! – flymetothemoon200 Jun 17 '21 at 11:58
  • Have you tried [tag:satpy]? It uses dask for processing, so it might be faster. – gerrit Jan 14 '22 at 08:37

0 Answers0