7

I'm downloading daily 600MB netcdf-4 files that have this structure:

netcdf myfile {
dimensions:
        time_counter = 18 ;
        depth = 50 ;
        latitude = 361 ;
        longitude = 601 ;
variables:
        salinity
        temp, etc     

I'm looking for a better way to convert the time_counter dimension from a fixed size (18) to an unlimited dimension.

I found a way of doing it with the netcdf commands and sed. Like this:

ncdump myfile.nc | sed -e "s#^.time_counter = 18 ;#time_counter = UNLIMITED ; // (currently 18)#"  | ncgen -o myfileunlimited.nc

which worked for me for small files, but when dumping a 600 MB netcdf files, takes to much memory and time.

Somebody knows another method for accomplishing this?

Favo
  • 818
  • 6
  • 15
  • 1
    for the substitutions you are doing, `sed` and the chain of pipes are about efficient as it gets. Unless you go with a `hadoop` type solution, that will break the file up into parts, send the parts to multiple servers, perform the operation and then "glue" the files back together. I don't see how memory can be an issue, `sed` processes one line at a time. I don't know anything about the `nc` suite of tools, so maybe there is some option that will make `ncgen` run more efficiently? (probably not) Is your computer undersized for this task? Time to make the boss buy new! Yeah!! Good luck! – shellter Feb 19 '15 at 04:34
  • How much time and memory is "too much"? If memory use scales linearly, you shouldn't need much more than few dozen gigabytes I reckon. – Lars Viklund Feb 19 '15 at 05:07

3 Answers3

15

Your answers are very insightful. I'm not really looking a way to improve this ncdump-sed-ncgen method, I know that dumping a netcdf file that is 600MB uses almost 5 times more space in a text file (CDL representation). To then modify some header text and generate the netcdf file again, doesn't feels very efficient.

I read the latest NCO commands documentation, and found a option specific to ncks "--mk_rec_dmn". Ncks mainly extracts and writes or appends data to a new netcdf file, then this seems the better approach, extract all the data of myfile.nc and write it with a new record dimension (unlimited dimension) which the "--mk_rec_dmn" does, then replace the old file.

ncks --mk_rec_dmn time_counter myfile.nc -o myfileunlimited.nc ; mv myfileunlimited.nc myfile.nc 

To do the opposite operation (record dimension to fixed-size) would be.

ncks --fix_rec_dmn time_counter myfile.nc -o myfilefixedsize.nc ; mv myfilefixedsize.nc myfile.nc
Favo
  • 818
  • 6
  • 15
  • 1
    It's acceptable on the site to accept your own answer if it answers the question. Nice find. – Lars Viklund Feb 19 '15 at 23:26
  • 1
    Favo, you have found the right method. Doing it with NCO reduces memory consumption considerably over the sed approach. Only one variable is held in memory at a time. The ncdump/sed method does not scale. Plus NCO records the metadata change in the "history" attribute, so downstream users know what you did. – Charlie Zender Mar 10 '15 at 11:36
  • 2
    You do not need `-o myfileunlimited.nc ; mv myfileunlimited.nc myfile.nc`, just say `-O -o myfile.nc` (-O is optional, for overwrite the original file without asking). – Kostas Aug 29 '20 at 21:01
1

The shell pipeline can only be marginally improved by making the sed step only modify the beginning of the file and pass everything else through, but the expression you have is very cheap to process and will not make a dent in the time spent.

The core problem is likely that you're spending a lot of time in ncdump formatting the file information into textual data, and in ncgen parsing textual data into a NetCDF file format again.

As the route through dump+gen is about as slow as it is shown, that leaves using NetCDF functionality to do the conversion of your data files.

If you're lucky, there may be tools that operate directly on your data files to do changes or conversions. If not, you may have to write them yourself with the NetCDF libraries.

If you're extremely unlucky, NetCDF-4 files are HDF5 files with some extra metadata. In particular, the length of the dimensions is stored in the _netcdf_dim_info dataset in group _netCDF (or so the documentation tells me).

It may be possible to modify the information there to turn the current length of the time_counter dimension into the value for UNLIMITED (which is the number 0), but if you do this, you really need to verify the integrity of the resulting file, as the documentation neatly puts it:

"Note that modifying these files with HDF5 will almost certainly make them unreadable to netCDF-4."

As a side note, if this process is important to your group, it may be worth looking into what hardware could do the task faster. On my Bulldozer system, the process of converting a 78 megabyte file takes 20 seconds, using around 500 MB memory for ncgen working set (1 GB virtual) and 12 MB memory for ncdump working set (111 MB virtual), each task taking up the better part of a core.

Any decent disk should read/sink your files in 10 seconds or so, memory doesn't matter as long as you don't swap, so CPU is probably your primary concern if you take the dump+gen route.

If concurrent memory use is a big concern, you can trade some bytes for space by saving the intermediary result from sed onto disk, which will likely take up to 1.5 gigabytes or so.

Lars Viklund
  • 992
  • 7
  • 12
1

You can use the xarray python package's xr.to_netdf() method, then optimise memory usage via using Dask.

You just need to pass names of the dimensions to make unlimited to the unlimited_dims argument and use the chunks to split the data. For instance:

import xarray as xr
ds = xr.open_dataset('myfile.nc', chunks={'time_counter': 18})
ds.to_netcdf('myfileunlimited.nc', unlimited_dims={'time_counter':True})

There is a nice summary of combining Dask and xarray linked here.

tsherwen
  • 1,076
  • 16
  • 21