So you have two lists, list1
and list2
. Your implementation checks 4000 times (the length of list1
) if a given element is in list2
.
A first improvement would be to make list2
shorter by converting it to a set (since this will remove duplicates). As you see in the example, in this case the execution time has been reduced from 102 ms to 2.42 ms.
In [390]: list1=sorted(random.randint(0,1000) for r in range(4000))
In [391]: list2=np.random.randint(2000, size=2000)
In [392]: %timeit np.array([1 if (feat in list2) else 0 for feat in list1])
10 loops, best of 3: 102 ms per loop
In [393]: list2=set(list2)
In [394]: %timeit np.array([1 if (feat in list2) else 0 for feat in list1])
100 loops, best of 3: 2.42 ms per loop
Another thing to do would be to avoid repeating lookups for the same element in list1
by saving in a dictionary the results for the elements that have already been seen. In this way you take care of duplicates in list1
.
In [395]: def get_feature_list():
...: d=dict.fromkeys(list2,1)
...: feature_list2=[]
...: for f in list1:
...: if f not in d:
...: d[f]=int(f in list2)
...: feature_list2.append(d[f])
...: return np.array(feature_list2)
...:
In [396]: %timeit get_feature_list()
100 loops, best of 3: 5.46 ms per loop
Using an extra dictionary to store results doesn't seem to bring any improvement though. This might be due to the extra overhead of creating a dictionary.
With a new function similar to the one above there is an improvement in running time but only for very large arrays
In [440]: def get_feature_list():
...: d=[]
...: feature_list2=[]
...: for f in list1:
...: if f not in list2:
...: if f not in d:
...: d.append(f)
...: feature_list2.append(0)
...: else:
...: feature_list2.append(1)
...: return np.array(feature_list2)
...:
In [441]: list1=sorted(random.randint(0,1000) for r in range(400000))
In [443]: %timeit np.array([1 if (feat in list2) else 0 for feat in list1])
1 loop, best of 3: 273 ms per loop
In [442]: %timeit get_feature_list()
1 loop, best of 3: 1.6 s per loop
Another thing that could be done would be to remove an element from list2
after it's been found, so that the size list2
decreases at each step.
Right now I can't think of a way of exploiting the fact that list1
is sorted to improve efficiency.