0

Here is what I have:

long_list = a very long list of integer values (6M+ entries)

wanted_list = a list of integer values that are of interest (70K entries)


What I need:

mask_list = a list of booleans of the same length as long_list, describing whether each element of long_list in present in the wanted_list (i.e. [is long_list[0] in wanted_list?, is long_list[1] in wanted_list?,....]). The number of'True' entries in this list should be the same as len(wanted_list)


I got a working code using a for loop, but as expected it's way too slow for the length of the lists that I am working with (takes several minutes to run):

masklist = []

for element in long_list:
    if element in wanted_list:
        masklist.append(True)
    else: 
        masklist.append(False)

I was wondering if there is a more elegant and fast way to achieve this goal? I was looking into numpy.ma module, but could not think of an elegant way to apply it to this problem

  • Does this answer your question? [Numpy mask based on if a value is in some other list](https://stackoverflow.com/questions/13629061/numpy-mask-based-on-if-a-value-is-in-some-other-list) – August Sep 30 '22 at 16:43
  • You are using the wrong data type for this type of operation. Both of your collections of numbers should be held in `sets`. Is index position of the match important in `long_list` or just the *values*? – AirSquid Sep 30 '22 at 16:44

2 Answers2

1

You can use numpy.isin for this:

masklist = np.isin(long_list, wanted_list)
Jab
  • 26,853
  • 21
  • 75
  • 114
0

This is slow because the algorithm has a poor complexity (O(n²)). This can be done in O(n):

wanted_set = set(wanted_list)
masklist = [element in wanted_set for element in long_list]

If you operate on Numpy array, then you can use np.unique and np.searchsorted but it will be computed in O(n log n) time. It may be faster for small arrays (due to the Numpy implementation), but not for huge ones (due to the complexity and inefficient cache accesses). np.isin may also be useful and maybe faster.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • That makes sense, thank you! The code you suggested is definitely much more sustainable, and np.isin was the fastest option. – jingajinga Sep 30 '22 at 17:50