Convert the list keep
into a set
, since that will be checked frequently. Iterate over train
, since you want to keep order and repeats. That makes set
not an option. Even if it was, it wouldn't help, since the iteration would have to happen anyway:
keeps = set(keep)
train_keep = [k for k in train if k in keeps]
A lazier, and probably slower version would be something like
train_keep = filter(lambda x: x in keeps, train)
Neither of these options will give you a large speedup you'd probably be better off using numpy or pandas or some other library that implements the loops in C and stores numbers as something simpler than full-blown python objects. Here is a sample numpy solution:
train = np.array([...])
keep = np.array([...])
train_keep = train[np.isin(train, keep)]
This is likely an O(M * N)
algorithm rather than O(M)
set lookup, but if checking N
elements in keep
is faster than a nominally O(1)
lookup, you win.
You can get something closer to O(M log(N))
using sorted lookup:
train = np.array([...])
keep = np.array([...])
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]
A better alternative might be to append np.inf
or a maximum out-of-bounds integer to the sorted keep
array, so you don't have to distinguish missing from edge elements with extra
at all. Something like np.max(train.max() + 1, keep.max())
would do:
train = np.array([...])
keep = np.array([... 99999])
keep.sort()
ind = np.searchsorted(keep, train, side='left')
train_keep = train[keep[ind] == train]
For random inputs with train.size = 10000
and keep.size = 10
, the numpy method is ~10x faster on my laptop.