1

I have two lists :

  • List1 is 1x4000, a list of Ordered features (and order is crucial for the rest of my code)
  • List2 is of random length and is the list I need to extract the features from.

The required output is:

  • A numpy list of size 1x4000 of binaries, so that feature_list[i] is equal to 1 if the ith feature of List1 is present in List2, 0 otherwise.

So far I am using the below code:

feature_list = np.array([1 if (feat[1] in List2) else 0 for feat in List1])

But as you can imagine, this takes a very long time when called thousand times.

These threads Stackoverflow 1, Stackoverflow 2 recommend the use of a dict or a set, but I need to call this indexing thousands of times and it has to be in the same order everytime.

Any ideas?

Community
  • 1
  • 1
ylnor
  • 4,531
  • 2
  • 22
  • 39

1 Answers1

0

So you have two lists, list1 and list2. Your implementation checks 4000 times (the length of list1) if a given element is in list2.

A first improvement would be to make list2 shorter by converting it to a set (since this will remove duplicates). As you see in the example, in this case the execution time has been reduced from 102 ms to 2.42 ms.

In [390]: list1=sorted(random.randint(0,1000) for r in range(4000))

In [391]: list2=np.random.randint(2000, size=2000)

In [392]: %timeit np.array([1 if (feat in list2) else 0 for feat in list1])
10 loops, best of 3: 102 ms per loop

In [393]: list2=set(list2)

In [394]: %timeit np.array([1 if (feat in list2) else 0 for feat in list1])
100 loops, best of 3: 2.42 ms per loop

Another thing to do would be to avoid repeating lookups for the same element in list1 by saving in a dictionary the results for the elements that have already been seen. In this way you take care of duplicates in list1.

In [395]: def get_feature_list():
     ...:     d=dict.fromkeys(list2,1)
     ...:     feature_list2=[]
     ...:     for f in list1:
     ...:         if f not in d:
     ...:             d[f]=int(f in list2)
     ...:         feature_list2.append(d[f])
     ...:     return np.array(feature_list2)
     ...:

In [396]: %timeit get_feature_list()
100 loops, best of 3: 5.46 ms per loop

Using an extra dictionary to store results doesn't seem to bring any improvement though. This might be due to the extra overhead of creating a dictionary.

With a new function similar to the one above there is an improvement in running time but only for very large arrays

In [440]: def get_feature_list():
     ...:     d=[]
     ...:     feature_list2=[]
     ...:     for f in list1:
     ...:         if f not in list2:
     ...:             if f not in d:
     ...:                 d.append(f)
     ...:                 feature_list2.append(0)
     ...:             else:
     ...:                 feature_list2.append(1)
     ...:     return np.array(feature_list2)
     ...:

In [441]: list1=sorted(random.randint(0,1000) for r in range(400000))

In [443]: %timeit np.array([1 if (feat in list2) else 0 for feat in list1])
1 loop, best of 3: 273 ms per loop

In [442]: %timeit get_feature_list()
1 loop, best of 3: 1.6 s per loop

Another thing that could be done would be to remove an element from list2 after it's been found, so that the size list2 decreases at each step.

Right now I can't think of a way of exploiting the fact that list1 is sorted to improve efficiency.

user2314737
  • 27,088
  • 20
  • 102
  • 114