Count number of occurrences of a given list of strings in a data stream

Question

I have a stream of data something like,

stream = "carracecowtenhihellocohiwcar ......"

and i have to get no. of occurrence of all words in the list from the stream

words = ["car", "cow", "hi", ....]

So, result would be something like

result = {
  "car": 2,
  "cow": 1,
  "hi": 2,
  ....
  ....
}

with my current implementation, I am iterating through the list of words and add them to dict as below,

I am looking for some better way to do it, as list of words keep on increasing and data from stream in continuous.

This is what I have currently,

import re
def word_count(stream_obj):

    mydict = {}
    words = ["car", "cow", "hi", "hello"]
    max_word_len = len(max(words, key=len))
    regex = re.compile("|".join(words))
    last_chunk_remainder = ""

    while(stream_obj.getchunk() is not None):
        stream_data = last_chunk_remainder + stream_obj.getchunk()
        for word in words:
            mydict[word] = stream_data.count(word)

        # to handle the corner case like if the stream chunk ends with
        # “ca” and first letter of next is "r", so that make the word
        # match words in the list, which is "car"
        if not regex.findall(stream_data[-max_word_len:]):
            last_chunk_remainder = stream_data[-max_word_len:]

Thanks

Do you know about the method [`str.count`](https://docs.python.org/3/library/stdtypes.html#str.count)? If not, can you figure out the rest yourself? — abarnert, Oct 10 '14 at 22:19
See this: http://stackoverflow.com/questions/8899905/count-number-of-occurrences-of-a-given-substring-in-a-string — Joey Ezekiel, Oct 10 '14 at 22:19
@JoeyEzekiel: When you find an obvious dup, don't just post it as a comment; close, vote to close, or flag to close (depending on how much rep you have). — abarnert, Oct 10 '14 at 22:20
@abarnert: I don't think this guy will get a good answer, but this is not a dupe of that question. — Dair, Oct 10 '14 at 22:21
@user667648: The original version of the question was an obvious dup. Now that he's edited it so he's basically doing the same thing as the answer to that dup… I'm not sure what exactly his question is, but I'll reopen it. — abarnert, Oct 10 '14 at 22:28
So, what does "some better way to do it" mean? If your existing code is working, what's wrong with it, or needs to be improved about it? More readable? Faster? Shorter? Better error handling? — abarnert, Oct 10 '14 at 22:29
@abarnert by that I mean, faster/efficient/scalable, data is coming from stream and never ends, and list words to be filtered are also not fixed. — Adeel, Oct 10 '14 at 22:32
Do you mean to say that you have a continuous stream of letters, and you would like to recognize any letter sequences coming out of that stream which match any word in some (vast) corpus of words in (say) the English language, and keep a running count of how many times each word has appeared? — jacg, Oct 10 '14 at 22:51
OK, so what you really need is: something you call when a new letter comes in on the stream, and something else you call when a new word gets added to the list, each of which modify the same counter (and maybe also notify you if there were any changes among the top 100 or something like that)? If so, that's actually a more interesting question, but please edit the question to make it clearer. — abarnert, Oct 10 '14 at 23:02
Right, this is starting to sound much more interesting. My immediate thoughts tend towards tries or finite state machines. It looks non trivial, and something that I don't have time to explore in depth right now. I'm sure *somebody* out there knows the answer of the top of his head. — jacg, Oct 10 '14 at 23:07
I think I could explain a "dumb" solution to this, and then try to explain how a trie makes it simpler and more efficient… but not during a single compile cycle, so maybe I'll try later tonight. — abarnert, Oct 10 '14 at 23:34
@abarnert I have updated question and added implementation I have including the corner case that I am trying to cover, sorry I am still not clear about the question, please do let me know if you need more clarification. Thanks — Adeel, Oct 10 '14 at 23:36
@Aedil: The new version is full of syntax errors and broken indentation. Can you fix that? More importantly, it's still not clear to me whether you want to handle overlapping matches or just non-overlapping matches, or what the interface to this is supposed to be. — abarnert, Oct 10 '14 at 23:37
@abarnert fixed the syntax, when you say handle you mean match? if yes, then we have to match non overlapping matches. — Adeel, Oct 10 '14 at 23:55
@Aedil: Yes, I meant "match". I was trying to find a way to avoid saying "match overlapping matches" without repeating the word "match" too many match-times, but obviously that wasn't the right way. Meanwhile, I still need to know: what is the interface for this supposed to be? Obviously you're not going to call `word_count(stream)` every time you get a new character or a new word added, are you? So, what do you want to be able to call? — abarnert, Oct 11 '14 at 00:31
So basically stream is an class object subscribed to a port (actually kafka topic), and ```get_chunk()``` is a method that will always give you some data or empty string, incase of connection lost method returns None that will break while loop at the moment (but can be changed to anything). Variables we use to store the count will also be moved out to some sort of storage (redis, memcache). — Adeel, Oct 11 '14 at 08:50
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/62871/discussion-between-aedil-and-abarnert). — Adeel, Oct 11 '14 at 09:32

score 2 · Answer 1 · answered Oct 10 '14 at 22:18

2

stream = "carracecowtenhihellocohiwcar"
words = ["car", "cow", "hi"]
print { word:stream.count(word) for word in words }

answered Oct 10 '14 at 22:18

jacg

2,040
1
14
27

Did not down vote, but this isn't actually the right answer to this question. – Dair Oct 10 '14 at 22:22
@Smac89 Probably because it is a lazy question, and some believe that it should not be rewarded with an answer. – jacg Oct 10 '14 at 22:23
@user667648 AFAICT it gives an answer that conforms to his spec. How is it wrong? – jacg Oct 10 '14 at 22:26
@user4122880: The question is actually very involved, he needs to take stuff from like an actual (English) dictionary, and then using some sort of sequence detection algorithm to find how a word appears in the string. Saying that, your answer does give a start toward the right direction. – Dair Oct 10 '14 at 22:28
@user667648: I don't understand what you're seeing in the question, but if you think you can explain it better than the OP, please edit his question. Especially if there's a good question here whose answer would be non-trivial and interesting. – abarnert Oct 10 '14 at 22:30
@abarnert: Well, I am not really proposing to reopen it, since the question is vague (it seems to be more of a full on project), I just don't think it is a duplicate. – Dair Oct 10 '14 at 22:31
1

@user667648 Ah, sorry, I don't have the benefit of a telepathic connection to OP's brain, so I had to go on what OP wrote. And I stand by my solution as conforming to OP's *publicly* stated requirements :-) – jacg Oct 10 '14 at 22:33
@user4122880: Well, I didn't downvote, if you look at my page, you'll see I've casted a total of 0 downvotes because I don't really believe in giving them. – Dair Oct 10 '14 at 22:34
@user667648 I'm not suggesting that you downvoted. I'm merely trying to understand your claim that it is not a correct answer to the question. – jacg Oct 10 '14 at 22:37
@user4122880: Upon looking at the original question again, you're right, lol. Have not had coffee. Too tired. – Dair Oct 10 '14 at 22:39
@user4122880: I was thinking along the lines of the fact his list had an ellipsis and yours didn't. I didn't realize he actually was using a list of words, which would eat RAM up. Thought that might of been part of the problem he is facing, and it still might be, since he need things to be more "efficient". – Dair Oct 10 '14 at 22:41
@user4122880 sorry buddy, I think I just post the question in hurry, but yeah you are right, that is exactly what I am doing currently, what I want to find out is, if there is any better way to do this. – Adeel Oct 10 '14 at 22:41
1

@user667648 No worries. Enjoy the coffee. – jacg Oct 10 '14 at 22:41

jacg · Accepted Answer · 2014-10-11T15:16:03.500

So I played around a bit with a trie-based approach to your problem (now that I understand what you want). Maybe you can find something useful in it. There is an initial idea, an abstract interface slapped around that idea, to help looking for more efficient solutions, and a few pytest tests to help understand whether/how it works. There are trie modules out there, but this seemed like more fun, for now.

from collections import defaultdict

# Faking an infinite stream of characters
from itertools import cycle
stream = cycle('carracecowtenhihellocohiwcar')

# Just exploring the idea of a trie. If it works, we can think about a
# more efficient implementation later.
def new_trie_branch():
    return defaultdict(new_trie_branch)

# A symbol used to indicate leaves in the trie
END_OF_WORD = object()

# The trie is implemented as a dictionary mapping letters to
# sub-tries. The pseudo-letter END_OF_WORD marks the end of a path in
# the trie which corresponds to a valid whole word.
def make_trie(words):
    trie = new_trie_branch()
    for word in words:
        branch = trie
        for letter in word:
            branch = branch[letter]
        branch[END_OF_WORD] = True
    return trie

# As each letter comes out of the stream, it is fed into a collection
# of 'listeners'. Each listener is a stateful function which
# corresponds to some location in the trie and is aware of the word
# prefix which describes the path from the trie's root to the current
# node. When such a listener is given a letter, it checks (in the trie)
# whether the prefix plus the new letter form a complete word: if so,
# it bumps the word count for that word. It also checks whether the
# prefix plus the new letter form a valid longer prefix: if so, it
# adds a new listener (corresponding to the next node in the trie)
# into the collection of listeners that will be applied to the next letter to
# come out of the stream.
def count_words_in_stream(words, stream, word_count = None):
    word_count = defaultdict(int) if word_count is None else word_count

    def make_listener(branch, prefix):
        def listener(next_letter):
            if next_letter in branch:
                next_branch = branch[next_letter]
                word = prefix + next_letter
                if END_OF_WORD in next_branch:
                    word_count[word] += 1
                next_listeners.append(make_listener(next_branch, word))
        return listener

    start_of_word_listener = make_listener(make_trie(words), '')
    listeners = [start_of_word_listener]
    for letter in stream:
        next_listeners = [start_of_word_listener]
        for listen in listeners:
            listen(letter)
        listeners = next_listeners
    return word_count

# Now we try to come up with an implementation-independent interface
# for the trie to allow us to refactor more easily in search of a more
# efficient implementation, if necessary.
class Trie(object):

    def __init__(self, words):
        self._trie = make_trie(words)

    # Checks whether the given WORD is present in the trie
    def __contains__(self, word):
        trie = self._trie
        for letter in word:
            if letter not in trie:
                return False
            trie = trie[letter]
        else:
            return END_OF_WORD in trie

    # The 'in' operator (__contains__) checks for the presence of a
    # whole word in the trie, so we need a different interface for
    # checking whether a given branch exists at this node.
    def has_branch(self, branch_id):
        return branch_id in self._trie

    # Picks one branch of the trie
    def __getitem__(self, branch_id):
        branch = Trie('')
        branch._trie = self._trie[branch_id]
        return branch

    # Iterates over the branches of this trie
    def __iter__(self):
        return iter(self._trie)

# Same as count_words_in_stream above, but uses the abstract interface
# we just invented.
def abstract_count_words_in_stream(words, stream, word_count = None):
    word_count = defaultdict(int) if word_count is None else word_count

    def make_listener(branch, prefix):
        def listener(next_letter):
            if branch.has_branch(next_letter):
                next_branch = branch[next_letter]
                word = prefix + next_letter
                if next_branch.has_branch(END_OF_WORD):
                    word_count[word] += 1
                next_listeners.append(make_listener(next_branch, word))
        return listener

    start_of_word_listener = make_listener(Trie(words), '')
    listeners = [start_of_word_listener]
    for letter in stream:
        next_listeners = [start_of_word_listener]
        for listen in listeners:
            listen(letter)
        listeners = next_listeners
    return word_count

################################################################################
# Some tests of the implementation. These are written in the pytest
# framework.
################################################################################
from pytest import mark

# Testing the specific implementation details. Just to get us going.
@mark.parametrize('words, trie', (
    (['one'],
     {'o': {'n': {'e': {END_OF_WORD: True}}}}),
    ('one two'.split(),
     {'o': {'n': {'e': {END_OF_WORD: True}}},
      't': {'w': {'o': {END_OF_WORD: True}}}}),
    ('abc abd'.split(),
     {'a': {'b': {'c': {END_OF_WORD: True},
                  'd': {END_OF_WORD: True}}}})
))
def test_make_trie(words, trie):
    assert make_trie(words) == trie

count_words_test_data = ('words, stream, expected', (
    (['cow'] ,'abcdefg', {}),
    (['cow'] ,'cowcowcow', {'cow':3}),
    ('cow car fish'.split(), 'cowcarfishcarcarfishcow',
     {'cow':2, 'car':3, 'fish':2}),
    ('and hand handy'.split(), 'handyandhand',
     {'and':3, 'hand':2, 'handy':1}),
))

@mark.parametrize(*count_words_test_data)
def test_count_words_in_stream(words, stream, expected):
    assert count_words_in_stream(words, stream) == expected


################################################################################
# Testing the abstract Trie interface. This will help if we want to
# refactor the implementation in search of something more efficient.
################################################################################
@mark.parametrize('words, absents', (
    ('one'.split(), 'o on ono'.split()),
    ('o on one'.split(), []),
    ('abc abd'.split(), ['ab'])
))
def test_Trie_word_presence(words, absents):
    trie = Trie(words)
    for word in words:
        assert word in trie
    for absent in absents:
        assert absent not in trie

@mark.parametrize(*count_words_test_data)
def test_abstract_count_words_in_stream(words, stream, expected):
    assert abstract_count_words_in_stream(words, stream) == expected

score 0 · Answer 3 · answered Oct 11 '14 at 21:11

I got it working and have tried to cover all know corner cases, will be really thankful if you can propose some suggestions/improvements, Thanks for help, and sorry for initial incomplete question.

import re
from collections import defaultdict

WORD_COUNTS = defaultdict(int)
WORDS = ["car", "cat", "cow", "hi", "hello"]
MAX_WORD_LEN = len(max(WORDS, key=len))
REGEX = ("|".join(WORDS))
RE_OBJ = re.compile(REGEX)

def count_words(stream):
    last_stream_remainder = ""

    while True:
        data = stream.get_chunk()

        # Breaking point 
        if data is None:
            break

        if not data:
            continue

        data = last_stream_remainder + data
        for match in RE_OBJ.finditer(data):
            WORD_COUNTS[match.group(0)] += 1

        # to cover the corner case like remainder from last 
        # chunk can attach with new one and make a word
        if match:
            if match.end() >= len(data):
                continue
            else:
                last_match = min((len(data) - match.end()), MAX_WORD_LEN)
                last_stream_remainder = data[-last_match:]
        else:
            last_stream_remainder = data[-MAX_WORD_LEN:]

class StreamReader(object):
    STREAM_DATA = ["car1cat1lftrysomecow1shi1iamgoinghello1pleasegoocar2sarehere",
                   "car3car4car5cat2cat3h", "i2thisishello2hello3he", "", "llo4", None]

    def get_chunk(self):
        return self.STREAM_DATA.pop(0)

stream = StreamReader()
count_words(stream)

print WORD_COUNTS.items()
# [('car', 5), ('hi', 3), ('hello', 4), ('cow', 1), ('cat', 3)]

score 0 · Answer 4 · answered Jul 18 '16 at 03:58

Here's my take on it. Takes O(k) time per character, or O(nk) for the whole stream, where k is the length of the word and n is length of the stream; and O(k) space.

class Solution:
  def __init__(self, s):
    self.buff, self.count, self.s = '', 0, s
  def process(self, a):
    self.buff += a
    if len(self.buff) > len(self.s):
      self.buff = self.buff[1:]
      if (self.buff) == self.s:
        self.count += 1

And here are some tests:

solution = Solution('cocoa')
solution.process('c')
solution.process('o')
solution.process('c')
solution.process('o')
assert solution.count == 0
solution.process('c')
solution.process('o')
solution.process('a')
assert solution.count == 1
print('First test passed')

solution.count = 0
solution = Solution('acbcc')
stream = 'acbcbcc'
for a in stream:
  solution.process(a)
assert solution.count == 0
print('Second test passed')

score 0 · Answer 5 · answered Jun 02 '19 at 19:09

I tried the below code and it worked well for me. Used trie trees for solving this problem.

from collections import defaultdict
from itertools import cycle

def new_trie_branch():
    return defaultdict(new_trie_branch)

END_OF_WORD = object()


def make_trie_tree(words):
    trie = new_trie_branch()
    for word in words:
        branch = trie
        for letter in word:
            branch = branch[letter]
        branch[END_OF_WORD] = True
    return trie


def count_words_in_stream(words, stream, word_count = None):
    word_count = defaultdict(int) if word_count is None else word_count

    def make_listener(branch, prefix):
        def listener(next_letter):
            if next_letter in branch:
                next_branch = branch[next_letter]
                word = prefix + next_letter
                if END_OF_WORD in next_branch:
                    word_count[word] += 1
                next_listeners.append(make_listener(next_branch, word))
        return listener

    start_of_word_listener = make_listener(make_trie_tree(words), '')
    listeners = [start_of_word_listener]
    for letter in stream:
        next_listeners = [start_of_word_listener]
        for listen in listeners:
            listen(letter)
        listeners = next_listeners
    return word_count


stream = "acacathellockword"
words = ['aca','cat','hell','hello','lock','world']
print(dict(count_words_in_stream(words,stream)))

Output :

    {'aca': 2, 'cat': 1, 'hell': 1, 'hello': 1, 'lock': 1}

Count number of occurrences of a given list of strings in a data stream

5 Answers5