1

I am having a quite large amount of climate data stored in netcdf-files. Unfortunately sometimes things are going wrong and parts of data on our supercomputer is lost. The problem is I have to find all timesteps for which the data is missing.

First I read the time variable from all files with xarray and convert it to a list (list1). In the second step I create a list with all timesteps that should be there (list2). Now I want all elements that are in list2 but not in list1.

import datetime as dt
from dateutil.relativedelta import relativedelta
import numpy as np
# create artificial data
startdate = dt.datetime(1850,1,1,6,0,0)
enddate = dt.datetime(2001,1,1,6,0,0)
deltatime = relativedelta(hours=6)
date = startdate
list1=[]
list2=[]
i=0
while date <= enddate:
    date = startdate + i*deltatime
    list1.append(np.datetime64(date))
    i+=1
i=0
date = startdate
while date < enddate:
    date = startdate + i*deltatime
    list2.append(np.datetime64(date))
    i+=1
starttime = dt.datetime.now()
# get reduced list
for i in list2:
   list1.remove(i)
endtime = dt.datetime.now()
delta = endtime - starttime
print(delta)

The code is exactly doing what i want. In this easy example it just returns the last date of list1. My question: Is there a way to get better performance for larger lists?

Elim Garak
  • 33
  • 5
  • @Chris_Rands I don't think this is a duplicate of the linked question. Because we are dealing with `datetime` objects here, if the lists are huge, a set difference can be inefficient. Another possible approach is to construct `list2` on-the-fly by checking each newly read input if it is in `list1` using binary search. Packing these series in an indexing structure such as Pandas `DatetimeIndex` or using functions native to Pandas for dealing with timeseries can possibly speed things up. – lightalchemist Aug 14 '19 at 13:04
  • I just checked with np.setdiff1d(list1,list2): It is also faster with datetime objects. And it is fast enough even in my worst case scenario. Now I just have to think about if I really can use sets or if there is a possibility that I have the same datetime in one list twice. – Elim Garak Aug 15 '19 at 13:11

2 Answers2

4

I really like set analysis, where you can do:

set(list2) - set(list1)

Putting list items in a set removes all duplicates & ordering. Set operations allow us to remove a set of items from another set, just with the - operator.

If the list is enormous, numpy is a bit faster.

import numpy as np
np.setdiff1d(list1, list2)
Laurens Koppenol
  • 2,946
  • 2
  • 20
  • 33
2

Try:

list(set(list1) - set(list2))
Kostas Charitidis
  • 2,991
  • 1
  • 12
  • 23