Calculating the size (bytes) of subset large netCDF file from a THREDDS Data Server before reading the file using ncvar_get?

Question

I'm using nc_open to get a DatasetNode from a THREDDS Data Server, and reading a subset of the data in ncvar_get by specifying start and count. Reproducible example below:

library(thredds)
library(ncdf4)

Top <- CatalogNode$new("https://oceanwatch.pifsc.noaa.gov/thredds/catalog.xml") 
DD <- Top$get_datasets() 
dnames <- names(DD)
dname <- dnames[4] # "Chlorophyll a Concentration, Aqua MODIS - Monthly, 2002-present. v.2018.0"   
D <- DD[[dname]]

dl_url <- file.path("https://oceanwatch.pifsc.noaa.gov/thredds/dodsC", D$url)
dataset <- nc_open(dl_url)

dataset_lon <- ncvar_get(dataset, "lon") # Get longitude values
dataset_lat <- ncvar_get(dataset, "lat")  # Get latitude values
dataset_time <- ncvar_get(dataset, "time") # get time values in tidy format

# specify lon/lat boundaries for data subset:
lonmin = 160
lonmax = 161
latmin = -1
latmax = 0

LonIdx <- which(dataset_lon >= lonmin & dataset_lon <= lonmax)
LatIdx <- which(dataset_lat >= latmin & dataset_lat <= latmax)

# read the data for first 10 timesteps:
dataset_array <- ncvar_get(dataset, 
  start=c(findInterval(lonmin, dhw_lon), findInterval(latmax, sort(dhw_lat)), 1), 
  count=c(length(LonIdx), length(LatIdx), 10), varid="chlor_a", verbose=TRUE)

Is there a way to calculate the approximate file size for the ncvarget before reading the data?

This seems impossible. Could you get this information if the data was on disk without accessing the data? If you can’t do it with on disk data I can’t see how it could be done with opendap — Robert Wilson, Nov 19 '22 at 15:58
You’re talking about doing this after opening the dataset but before loading the variable data though right? So the on disk equivalent would be after opening a dataset but before load. If you know the dimensions and type of the data, then certainly! I’m familiar with python not R, but I’m sure you can look up the bit lengths of the various types. In python, float64 is 8 bytes (each byte is 8 bits; 8*8 =32). float32 is - you guessed it - 4 bytes. 77 million float64s = 77e6 * 8 = 594e6 bytes. Divide by 1024^2 = 587MB — Michael Delgado, Nov 19 '22 at 18:19
@MichaelDelgado I didn't point out in my comment that this appears to be marine data, which is guaranteed to have missing values, so for any workflow to work would need to know how many missing values are in the cells. I don't see how you calculate that without loading — Robert Wilson, Nov 21 '22 at 08:12
you're totally right. @marine-ecologist it very much depends on whether you're actually talking about the file size on disk, which might be compressed, drop NaNs, etc., vs. the size in memory. I've given the size in memory, which is an upper bound of what it could be on disk. — Michael Delgado, Nov 21 '22 at 17:24
The OP referred to "file size", though it is ambiguous. I don't use R much these days, but my understanding of ncvarget is that it will save the file as a temp before loading in this case, a bit like CDO/NCO. So download size seems impossible, but memory used by R should be easy to calculate as @MichaelDelgado says — Robert Wilson, Nov 22 '22 at 10:47

score 1 · Answer 1 · answered Dec 31 '22 at 21:47

Many thanks to both @michael-delgado and @robert-wilson for the above. I've edited the original post to include a reproducible example and answered my own question in case it helps anyone else later down the line.

If I understand correctly all current implementations of R use float32. Using the example Aqua MODIS Chlorophyll dataset in the post above:

An upper bound on the file size (assuming no NA) before to downloading the data with ncvar_get would be 23,040 bytes:

(length(LonIdx) * length(LatIdx) * 10) * 4 # based on 10 time steps

which is confirmed with the dimensions of data after downloading:

(dim(dataset_array)[1] * dim(dataset_array)[2] * dim(dataset_array)[3]) * 4

Writing the output array to disk produces a 20,444 byte file:

dataset_output <-  as.data.frame.table(dataset_array)
saveRDS(dataset_output, "dataset_output.rds")

which is close to the calculated upper limits (23,040 bytes). For me this approach is useful in obtaining an upper limit and approximate size before downloading the data using ncvar_get, many thanks to both of you.

(Out of interest, excluding NA in the above example leaves 4559 out of 5760 cells: (sum(!is.na(dataset_array)) * 4) which gives 18,236 bytes, smaller than the actual file size (20,444 bytes).

Calculating the size (bytes) of subset large netCDF file from a THREDDS Data Server before reading the file using ncvar_get?

1 Answers1