10

I'm reading NetCDF files with open_mfdataset, which contain duplicate times. For each duplicate time I only want to keep the first occurrence, and drop the second (it will never occur more often). The problem is quite similar to this Pandas question, but none of the solutions provided there seem to work with Xarray.

To reproduce the problem:

import numpy as np
import netCDF4 as nc4
import xarray as xr

# Create example NetCDF files
for t in range(2):
    nc    = nc4.Dataset('test{}.nc'.format(t), 'w')
    dim_t = nc.createDimension('time', None)
    var_t = nc.createVariable('time', 'f8', ('time',))
    var_s = nc.createVariable('var', 'f8', ('time',))
    var_t.setncattr('units', 'hours since 2001-01-01 00:00:00')
    var_t[:] = t*5+np.arange(6)
    var_s[:] = t*5+np.arange(6)+t
    nc.close()

# Read with xarray
f = xr.open_mfdataset(['test0.nc', 'test1.nc'])

The times in the resulting dataset are:

array(['2001-01-01T00:00:00.000000000', '2001-01-01T01:00:00.000000000',
       '2001-01-01T02:00:00.000000000', '2001-01-01T03:00:00.000000000',
       '2001-01-01T04:00:00.000000000', '2001-01-01T05:00:00.000000000',
       '2001-01-01T05:00:00.000000000', '2001-01-01T06:00:00.000000000',
       '2001-01-01T07:00:00.000000000', '2001-01-01T08:00:00.000000000',
       '2001-01-01T09:00:00.000000000', '2001-01-01T10:00:00.000000000'], dtype='datetime64[ns]')

Is there an easy way to remove the second ocurance of 2001-01-01T05:00:00.000000000? The real-life problem deals with multi-dimensional NetCDF files, so switching to Pandas is no option.

[update] The closest I get is following this answer; that works for my simple example as long as Dask is not used, if the files contain Dask arrays I get the error:

'last' with skipna=True is not yet implemented on dask arrays

But I don't see where I can/have to set skipna.

Bart
  • 9,825
  • 5
  • 47
  • 73

2 Answers2

27

I think xarray does not have its own method for this purpose, but the following works,

In [7]: _, index = np.unique(f['time'], return_index=True)

In [8]: index
Out[8]: array([ 0,  1,  2,  3,  4,  5,  7,  8,  9, 10, 11])

In [9]: f.isel(time=index)
Out[9]: 
<xarray.Dataset>
Dimensions:  (time: 11)
Coordinates:
  * time     (time) datetime64[ns] 2001-01-01 2001-01-01T01:00:00 ...
Data variables:
   var      (time) float64 dask.array<shape=(11,), chunksize=(6,)>
Keisuke FUJII
  • 1,306
  • 9
  • 13
  • Thanks so much for this solution. I had exactly the same problem as @bart above with duplicate time entries with `xr.open_mfdataset` and this solution solved it perfectly. – drg Oct 10 '18 at 23:56
10

Apparently stackoverflow won't let me comment... I wanted to add to Keisuke's answer. You can also use the get_index() function to get a pandas index.

f.sel(time=~f.get_index("time").duplicated())
Mark Boer
  • 109
  • 1
  • 2
  • Does not work. Returns `pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects`. – Biggsy Jul 12 '21 at 10:43
  • Works fine for me, but what does '~' do here ? – Abel Jul 28 '21 at 09:36
  • The `~` is the invert operator in numpy (and xarray) (https://numpy.org/doc/stable/reference/generated/numpy.invert.html). If used on a boolean array, such as the one that is returned by duplicated function, it is shorthand for a `not` operation. You could also use `np.logical_not` instead of `~` if you wish. You cannot use `~` as a replacement for np.logical_not for any other type of array, so be careful ;-) – Mark Boer Jul 29 '21 at 10:16