19

I have a folder with NetCDF files from 2006-2100, in ten year blocks (2011-2020, 2021-2030 etc).

I want to create a new NetCDF file which contains all of these files joined together. So far I have read in the files:

ds = xarray.open_dataset('Path/to/file/20062010.nc')
ds1 = xarray.open_dataset('Path/to/file/20112020.nc')
etc.

Then merged these like this:

dsmerged = xarray.merge([ds,ds1])

This works, but is clunky and there must be a simpler way to automate this process, as I will be doing this for many different folders full of files. Is there a more efficient way to do this?

EDIT:

Trying to join these files using glob:

for filename in glob.glob('path/to/file/.*nc'):
    dsmerged = xarray.merge([filename])

Gives the error:

AttributeError: 'str' object has no attribute 'items'

This is reading only the text of the filename, and not the actual file itself, so it can't merge it. How do I open, store as a variable, then merge without doing it bit by bit?

jhamman
  • 5,867
  • 19
  • 39
Pad
  • 841
  • 2
  • 17
  • 45
  • How is `dsmerged = xarray.merge([xarray.open_dataset(f) for f in glob.glob('path/to/file/.*nc')])`? – Abdou Nov 14 '17 at 16:29
  • Ok that almost made my computer implode and after un-crashing it said `memory error:` - this might be due to the size of the files? Perhaps my computer can't handle this? – Pad Nov 14 '17 at 17:00
  • 1
    You have more files than your machine's memory capacity can handle. You can test if the code I provided truly works by shortening the number of files to process as follows: `dsmerged = xarray.merge([xarray.open_dataset(f) for f in glob.glob('path/to/file/.*nc')[:2]])`. In this case, you are only processing two files. As for your memory issues, I would advise looking at [this](http://xarray.pydata.org/en/stable/dask.html#dask). – Abdou Nov 14 '17 at 17:05
  • I tried it with less files, it works! Thank you. I will try and sort out memory issues as you suggest also. – Pad Nov 14 '17 at 17:07
  • Hmm now I hit more issues using dask as `ValueError: Chunks do not align set([(60, 120, 120), (60, 240)])` – Pad Nov 14 '17 at 17:20
  • How did you get that error? The error is related to setting chunk sizes and those depend on the `dask` package and can be quite complicated to deal with. What happens if you don't set any chunk sizes while importing the data? – Abdou Nov 14 '17 at 18:40
  • Sorry for delay - the error came from using the command: `dsmerged = xarray.merge([xarray.open_mfdataset(f) for f in glob.glob( 'path/to/file/*.nc')])` - so this is without setting any chunk sizes. When specifying chunk sizes I get the same error but with different numbers in parantheses. – Pad Nov 15 '17 at 14:14
  • 3
    If you are using `xarray.open_mfdataset`, you don't need the `xarray.merge` operation. It's already being handled by `xarray.open_mfdataset`. Just `dsmerged = xarray.open_mfdataset('path/to/file/*.nc')` should suffice. – Abdou Nov 15 '17 at 14:17
  • That ran almost instantly. Thank you so much and apologies for missing the point on a number of occasions! – Pad Nov 15 '17 at 14:21
  • I am glad that helped. Please feel free to accept the provided answer whenever you can. – Abdou Nov 15 '17 at 14:39

1 Answers1

37

If you are looking for a clean way to get all your datasets merged together, you can use some form of list comprehension and the xarray.merge function to get it done. The following is an illustration:

ds = xarray.merge([xarray.open_dataset(f) for f in glob.glob('path/to/file/.*nc')])

In response to the out of memory issues you encountered, that is probably because you have more files than the python process can handle. The best fix for that is to use the xarray.open_mfdataset function, which actually uses the library dask under the hood to break the data into smaller chunks to be processed. This is usually more memory efficient and will often allow you bring your data into python. With this function, you do not need a for-loop; you can just pass it a string glob in the form "path/to/my/files/*.nc". The following is equivalent to the previously provided solution, but more memory efficient:

ds = xarray.open_mfdataset('path/to/file/*.nc')

I hope this proves useful.

Abdou
  • 12,931
  • 4
  • 39
  • 42
  • 1
    This question has been useful to so many people - thanks again! For anyone reading, the `open_mfdataset` command has been the best solution for me many times over the years. Very helpful! – Pad Apr 27 '22 at 10:02