Python: Fastest way to find all elements in one large list but not in another

Question

I am having a quite large amount of climate data stored in netcdf-files. Unfortunately sometimes things are going wrong and parts of data on our supercomputer is lost. The problem is I have to find all timesteps for which the data is missing.

First I read the time variable from all files with xarray and convert it to a list (list1). In the second step I create a list with all timesteps that should be there (list2). Now I want all elements that are in list2 but not in list1.

import datetime as dt
from dateutil.relativedelta import relativedelta
import numpy as np
# create artificial data
startdate = dt.datetime(1850,1,1,6,0,0)
enddate = dt.datetime(2001,1,1,6,0,0)
deltatime = relativedelta(hours=6)
date = startdate
list1=[]
list2=[]
i=0
while date <= enddate:
    date = startdate + i*deltatime
    list1.append(np.datetime64(date))
    i+=1
i=0
date = startdate
while date < enddate:
    date = startdate + i*deltatime
    list2.append(np.datetime64(date))
    i+=1
starttime = dt.datetime.now()
# get reduced list
for i in list2:
   list1.remove(i)
endtime = dt.datetime.now()
delta = endtime - starttime
print(delta)

The code is exactly doing what i want. In this easy example it just returns the last date of list1. My question: Is there a way to get better performance for larger lists?

@Chris_Rands I don't think this is a duplicate of the linked question. Because we are dealing with `datetime` objects here, if the lists are huge, a set difference can be inefficient. Another possible approach is to construct `list2` on-the-fly by checking each newly read input if it is in `list1` using binary search. Packing these series in an indexing structure such as Pandas `DatetimeIndex` or using functions native to Pandas for dealing with timeseries can possibly speed things up. — lightalchemist, Aug 14 '19 at 13:04
I just checked with np.setdiff1d(list1,list2): It is also faster with datetime objects. And it is fast enough even in my worst case scenario. Now I just have to think about if I really can use sets or if there is a possibility that I have the same datetime in one list twice. — Elim Garak, Aug 15 '19 at 13:11

Laurens Koppenol · Accepted Answer · 2019-08-14T12:45:21.577

4

I really like set analysis, where you can do:

set(list2) - set(list1)

Putting list items in a set removes all duplicates & ordering. Set operations allow us to remove a set of items from another set, just with the - operator.

If the list is enormous, numpy is a bit faster.

import numpy as np
np.setdiff1d(list1, list2)

edited Aug 14 '19 at 12:45

answered Aug 14 '19 at 12:43

Laurens Koppenol

2,946
2
20
33

score 2 · Answer 2 · answered Aug 14 '19 at 12:44

2

Try:

list(set(list1) - set(list2))

answered Aug 14 '19 at 12:44

Kostas Charitidis

2,991
1
12
23

Python: Fastest way to find all elements in one large list but not in another

2 Answers2