5

What's the best way to find the count of occurrences of strings from a list in a target string? Specifically, I have a list :

string_list = [
    "foo",
    "bar",
    "baz"
]

target_string = "foo bar baz bar"

# Trying to write this function!
count = occurrence_counter(target_string) # should return 4

I'd like to optimize to minimize speed and memory usage, if that makes a difference. In terms of size, I would expect that string_list may end up containing several hundred substrings.

Kevin Bedell
  • 13,254
  • 10
  • 78
  • 114
  • Possible duplicate of [python: get number of items from list(sequence) with certain condition](http://stackoverflow.com/questions/15375093/python-get-number-of-items-from-listsequence-with-certain-condition) – shapiro yaacov May 10 '17 at 13:56
  • 2
    Use suffix tries. Create a trie with your patterns and then iterate over the text while exploring the tree for occurences. Imagine that one of your patterns is `aaa` and your text is `aaaaa`. Solution should be 3 which you can achieve using suffix tries. –  May 10 '17 at 14:00
  • 3
    Are you strings always space separated words? – Chris_Rands May 10 '17 at 14:03
  • 4
    Also, what should be the result for `["foo", "bar", "foobar"]` and `"foobar"`? – Eric Duminil May 10 '17 at 14:07
  • Thanks for the good questions - my use case today is a list of space-delimited words where the entire word needs to be matched (i.e., no 'foobar'-like issues) – Kevin Bedell May 10 '17 at 15:44
  • Also, I need to match exact words -- that is, match 'foo', but not 'food' – Kevin Bedell May 10 '17 at 15:54

8 Answers8

5

Another way using collelctions.Counter:

from collections import Counter
word_counts = Counter(target_string.split(' '))
total = sum(word_counts.get(w, 0)) for w in string_list)
Aamir Rind
  • 38,793
  • 23
  • 126
  • 164
  • 2
    Does not answer the question, though! It says nowhere to only consider complete space-separated tokens. Occurrences in a string seems a more complex problem. – user2390182 May 10 '17 at 14:02
  • @schwobaseggl The OP also didn't say anything about complexity :) It answers what was asked in the question. – Aamir Rind May 10 '17 at 14:05
3

This works!

def occurrence_counter(target_string):
    return sum(map(lambda x: x in string_list, target_string.split(' ')))

The string gets split into tokens, then each token gets transformed into a 1 if it is in the list, a 0 otherwise. The sum function, at last, sums those values.

EDIT: also:

def occurrence_counter(target_string):
    return len(list(filter(lambda x: x in string_list, target_string.split(' '))))
gioaudino
  • 573
  • 8
  • 23
  • In this case `1 if x if string_list else 0` is the same as `x in string_list`. But anyway, that does not count number of occurrences. – bereal May 10 '17 at 13:59
  • You're right, edited. Why should it not count the occurrences? – gioaudino May 10 '17 at 14:03
  • If a string appears in the target list few times, the OP wants it to be counted each time (so the expected result in the example is 4, though the `string_list` contains only 3 elements). – bereal May 10 '17 at 14:05
  • It does return 4 :) each token will be 1 if it is in the `string_list`. In the example it would be `[1, 1, 1, 1]` so sum is 4 – gioaudino May 10 '17 at 14:07
2

This Python3 should work:

In [4]: string_list = [
   ...:     "foo",
   ...:     "bar",
   ...:     "baz"
   ...: ]
   ...: 
   ...: set_of_counted_word = set(string_list)
   ...: 
   ...: def occurrence_counter(target_str, words_to_count=set_of_counted_word):
   ...:     return sum(1 for word in target_str.strip().split()
   ...:                if word in words_to_count)
   ...: 
   ...: 
   ...: for target_string in ("foo bar baz bar", " bip foo bap foo dib baz   "):
   ...:     print("Input: %r -> Count: %i" % (target_string, occurrence_counter(target_string)))
   ...: 
   ...: 
Input: 'foo bar baz bar' -> Count: 4
Input: ' bip foo bap foo dib baz   ' -> Count: 3

In [5]:
Paddy3118
  • 4,704
  • 27
  • 38
1

You could use a variable to store a running count is you iterate through the list like so:

def occurence_counter(x):
    count = 0
    for y in x:
        count +=1
    return count
DataNinja
  • 67
  • 1
  • 9
1

Another solution:

def occurrence_counter(target_string, string_list):
    target_list = target_string.split(' ')
    return len([w for w in target_list if w in string_list])
Carlos Mayo
  • 2,054
  • 1
  • 19
  • 35
1

Combo of sum and string.count:

def counter(s, lst)
    return sum(s.count(sub) for sub in lst)

This will not count overlapping occurrences of the same pattern.

user2390182
  • 72,016
  • 6
  • 67
  • 89
  • @SembeiNorimaki: It's not yet clear what the desired result is with overlapping patterns. – Eric Duminil May 10 '17 at 14:09
  • @SembeiNorimaki True. That feels like an edge-case, though. It is however pythonic, built-in, concise, and probably more performant than the other suggestions that suffer from the same problem to an even higher degree ;) – user2390182 May 10 '17 at 14:09
  • Why should it be more performant than other solutions? There's no optimization, so it's `O(m*n)` with `n` characters and `m` elements in lst. If the elements are separated by spaces, the `Counter` solution is much faster. If not, a Trie solution would be much faster. – Eric Duminil May 10 '17 at 14:14
  • @EricDuminil True, however, a trie solution is not suggested (as an answer) and the token `Counter` does not take nearly as many substrings into consideration. In fact, there are no solutions here that do a true substring occurrrences count. – user2390182 May 10 '17 at 14:19
  • For my use case, overlapping patterns are currently not an issue. Also, I didn't state this (sorry!) but i need to match exact word boundaries (i.e., match 'foo', but not 'food') – Kevin Bedell May 10 '17 at 15:53
1

You could use a Trie to convert your substrings to a regex pattern (e.g. (?:ba[rz]|foo)) and parse your target_string:

import re
from trie import Trie

trie = Trie()

substrings = [
    "foo",
    "bar",
    "baz"
]
for substring in substrings:
    trie.add(substring)
print(trie.pattern())
# (?:ba[rz]|foo)

target_string = "foo bar baz bar"
print(len(re.findall(trie.pattern(), target_string)))
# 4

The required library is here : trie.py

It should be much faster than parsing the whole target_string for each substring, but it might not return the desired result for overlapping substrings. It returns 2 for ["foo", "bar", "foobar"] and "foobar".

A related question was : "Speed up millions of regex replacements in Python 3" : here's an answer with sets and one with a trie regex.

Community
  • 1
  • 1
Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
  • Would this make sense if `substrings` contained 2-300 words? – Kevin Bedell May 10 '17 at 15:51
  • @KevinBedell: With 2 words, surely not. With many dozens or a few hundreds, yes. The linked answers had many thousands, and the trie regex was 1000 faster than the regex union. – Eric Duminil May 10 '17 at 18:54
  • @KevinBedell: The accepted answer isn't efficient, and only works with complete words, separated by spaces. – Eric Duminil May 10 '17 at 18:57
  • My current need is only to work with complete words that are separated by spaces. It has the advantage of not requiring anything but builtin functions, is concise and took almost no time to adapt to my needs. After I finished coding my work, I chose the answer that was most useful in solving my problem. I upvoted this answer as I felt it was a solid answer as well. – Kevin Bedell May 10 '17 at 19:30
  • 1
    @KevinBedell: Sorry, I didn't want to sound rude. I understand your needs better now. In this case, you might want to use the `Counter` solution, which should be the fastest and cleanest one. – Eric Duminil May 10 '17 at 19:44
0

I am not sure this is the most pythonic way, but you can try it:

string_list_B = target_string.split(" ")
commonalities = set(string_list) - (set(string_list) - set(string_list_B))
stefan.stt
  • 2,357
  • 5
  • 24
  • 47