Optimize the search in a long list of strings

Question

I have the following MWE, where I use a list comprehension to search through a list ls for strings that are contained strings:

import numpy as np

strings = ["ASD", "DSA", "ABC", "ABQ"]
ls     = np.asarray(["ASD", "DSA", "ASD", "ABC", "ABQ","ASD", "DSA", "ASD", "ABC", "ABQ","ASD", "DSA", "ASD", "ABC", "ABQ"])

for string in strings:
    print(len(ls[[string in s for s in ls]]))

This works as intended - however, the issue is that my ls-list is very long (10^9 entries), and the list comprehension takes a considerable time.

Is there a way to optimize the above code?

EDIT: I'm looking for a solution that allows me to record the individual occurences, i.e. 6, 3, 3 and 3

How long are each of the strings? If they are reasonably small you could use a **trie**. — meowgoesthedog, Dec 21 '18 at 09:36
Do you have to use `numpy`? Or can you accept proposals without it? — Ralf, Dec 21 '18 at 09:38

yatu · Answer 1 · 2018-12-21T09:53:59.183

4

Use np.unique with return_counts=True and use np.in1d to perform boolean indexation and keep only the values in ls that are present in strings on both the unique values and the counts:

l, counts = np.unique(ls, return_counts=True)
mask = np.in1d(l,strings)

l[mask]
#array(['ABC', 'ABQ', 'ASD', 'DSA'], dtype='<U3')

counts[mask]
array([3, 3, 6, 3], dtype=int64)

edited Dec 21 '18 at 09:53

answered Dec 21 '18 at 09:35

yatu

86,083
12
84
139

Thanks - however, this gives me the total sum. I'd like to get the sum for each individual string in `strings` – N08 Dec 21 '18 at 09:43
I see, that's why it always helps to add a sample output – yatu Dec 21 '18 at 09:43
Edited, should do what you want now – yatu Dec 21 '18 at 09:50
Just a headsup: if @N08 needs the result to be in the same order as the original `strings` list, then additional processing seems to be requiered, because `l` and `counts` will not necesarely be in the same order as `strings`. – Ralf Dec 21 '18 at 10:34

Ralf · Accepted Answer · 2018-12-21T10:31:01.337

I suggest you use the ideas proposed in this post; the best approach could be using collections.Counter.

This builds the Counter once, and then you can easily look up the individual elements you want the count for.

This could look like this:

import collections
import numpy as np
import timeit

def _get_data(as_numpy):
    data = []
    for _ in range(10**6):
        data.extend(["ASD", "DSA", "ASD", "ABC", "ABQ"])

    if as_numpy:
        data = np.asarray(data)

    return data

def f1(data):
    search_list = ["ASD", "DSA", "ABC", "ABQ"]
    result_list = []

    for search_str in search_list:
        result_list.append(
            len(data[[search_str in s for s in data]]))

    return result_list

def f2(data):
    search_list = ["ASD", "DSA", "ABC", "ABQ"]
    result_list = []

    c = collections.Counter(data)
    for search_str in search_list:
        result_list.append(
            c[search_str])

    return result_list

def f3(data):
    search_list = ["ASD", "DSA", "ABC", "ABQ"]
    result_list = []

    c = collections.Counter(data)
    for search_str in search_list:
        result_list.append(
            data.count(search_str))

    return result_list

def f4(data):
    # suggestion by user 'nixon' in another answer to this question
    search_list = ["ASD", "DSA", "ABC", "ABQ"]

    l, counts = np.unique(data, return_counts=True)
    # 'l' and 'counts' are in different order than 'search_list'
    result_list = [
        counts[np.where(l == search_str)[0][0]]
        for search_str in search_list]

    return result_list

To assure that these approaches get the same results:

data1 = _get_data(as_numpy=True)
data2 = _get_data(as_numpy=False)
assert f1(data1) == f2(data2) == f3(data2) == f4(data1)

And comparing the timings, I get:

print(timeit.timeit(
    'f(data)',
    'from __main__ import f1 as f, _get_data; data = _get_data(as_numpy=True)',
    number=10))
print(timeit.timeit(
    'f(data)',
    'from __main__ import f2 as f, _get_data; data = _get_data(as_numpy=False)',
    number=10))
print(timeit.timeit(
    'f(data)',
    'from __main__ import f3 as f, _get_data; data = _get_data(as_numpy=False)',
    number=10))
print(timeit.timeit(
    'f(data)',
    'from __main__ import f4 as f, _get_data; data = _get_data(as_numpy=True)',
    number=10))

# f1 48.2 sec
# f2  1.7 sec
# f3  3.8 sec
# f4  9.7 sec

As you can see, there is an order of magnitud in time difference.

Does that work for your case?

EDIT: added approach using numpy.unique, similar to the one suggested by @nixon in another answer to this question; it still seems to be slower than using collections.Counter.

@N08 my tests seem to indicate that `numpy` is slower than `Counter` for this task. — Ralf, Dec 21 '18 at 10:32

Optimize the search in a long list of strings

2 Answers2