I am having a quite large amount of climate data stored in netcdf-files. Unfortunately sometimes things are going wrong and parts of data on our supercomputer is lost. The problem is I have to find all timesteps for which the data is missing.
First I read the time variable from all files with xarray and convert it to a list (list1). In the second step I create a list with all timesteps that should be there (list2). Now I want all elements that are in list2 but not in list1.
import datetime as dt
from dateutil.relativedelta import relativedelta
import numpy as np
# create artificial data
startdate = dt.datetime(1850,1,1,6,0,0)
enddate = dt.datetime(2001,1,1,6,0,0)
deltatime = relativedelta(hours=6)
date = startdate
list1=[]
list2=[]
i=0
while date <= enddate:
date = startdate + i*deltatime
list1.append(np.datetime64(date))
i+=1
i=0
date = startdate
while date < enddate:
date = startdate + i*deltatime
list2.append(np.datetime64(date))
i+=1
starttime = dt.datetime.now()
# get reduced list
for i in list2:
list1.remove(i)
endtime = dt.datetime.now()
delta = endtime - starttime
print(delta)
The code is exactly doing what i want. In this easy example it just returns the last date of list1. My question: Is there a way to get better performance for larger lists?