Fast way to do word co_occurrence count in python on big corpus

Question

I am trying to make a co_occurrence count work fast in python.

I have a list of context windows. Which is a list of words in some way connected by a dependency label after I parsed the corpus with a transition parser. Each context window has in average length 3. So a context window is not simply the 10 words before and after the focus word. That is why I need a list of lists. And why I can't just use a list of words.

I also have a dictionary where every distinct word of the corpus has a distinct index as value. But I don't think that using that speeds it up a lot. Right now I am only working with a small test corpus with 50,000 words and about 300,000 context windows my code is fast enough for that but I need it to be a lot faster because I need it to work on a much larger corpus in the end.

class CoOc:
    def __init__(self):
        self.co_occurrence = Counter()

    def add_count(self, token_index, context, token_to_index):
        try:
            # some words are not in the token dictionary if they dont appear frequent enough
            context_index = token_to_index[context]
        except KeyError:
            context_index = -1
        if context_index != -1:
            self.co_occurrence[(token_index, context_index)] += 1

    def count_co_occurrence(self, context_chains, token_index_dic):
        token_index_list = []
        for key in token_index_dic:
            token_index_list.append([key, token_index_dic[key]])
        time1 = current_milli_time()
        for token_key in token_index_list[0:100]:
            token = token_key[0]
            token_index = token_key[1]
            for context_chain in context_chains:
                if token in context_chain:
                    for context_word in context_chain:
                        if token != context_word:
                            self.add_count(token_index, context_word, token_index_dic)
        time2 = current_milli_time()
        print(time2-time1)

So in the count function from all the tokens, which are stored in the token_index dictionary, I make a list containing only a small section of it to check the time it takes.

I tried to loop through the list of all the context windows to create a list of context windows only consisting of indices before I loop through the tokens so I would not even need the add count function and just replace the line where I call add_count with the last line of add_count but that did not speed up the process.

Like this it takes 3 seconds to run, for only 100 focus words. I need it to work a lot faster to be working on the corpus I want to use it on. Is that even possible in python or do I need to implement it in another language?

So the first 10 context windows look like this:

['Gerücht', 'drücken', 'Kurs']
['Konkursgerüchte', 'drücken', 'Kurs']
['Kurs', 'Aktie']
['Kurs', 'Amazon-Aktie']
['der', 'Amazon-Aktie']
['der', 'Aktie']
['Begleitet', 'Gerücht', 'Konkurs']
['Begleitet', 'Marktgerüchten', 'Konkurs']
['begleiten', 'Gerücht', 'Konkurs']
['begleiten', 'setzt fort', 'Aktie', 'Talfahrt']

Each context word appears in the token_index_dic with some index unless it doesn't appear frequent enough in the corpus.

The first 10 elements of the token_index_list look like this:

['Browser-gestützte', 0]
['Akzeptanzstellen', 1]
['Nachschubplanung', 2]
['persönlichen', 3]
['Unionsfraktion', 4]
['Wired', 21122]
['Hauptfigur', 6]
['Strafgesetz', 7]
['Computer-Hersteller', 8]
['Rückschläge', 9]

and then when I print self.co_occurrence it looks like this:

# (focus_word_index, context_word_index): count
Counter({(1, 9479): 11, (1, 20316): 7, (2, 1722): 7, (2, 20217): 7, (2, 19842): 7, (2, 2934): 7, (3, 11959): 7, (3, 2404): 7, (3, 1105): 7, (4, 12047): 7, (4, 19262): 7, (0, 13585): 4, (1, 18525): 4, (1, 1538): 4, (1, 9230): 4, (1, 20606): 4, (1, 1486): 4, (2, 13166): 4, (2, 6948): 4, (2, 23028): 4, (2, 14051): 4, (3, 3854): 4, (3, 7908): 4, (3, 4902): 4, (3, 13222): 4, (4, 23737): 4, (4, 6785): 4, (4, 18718): 4, (5, 15424): 4, (5, 4394): 4, (5, 21534): 4, (5, 5829): 4, (5, 6513): 4, (6, 23331): 4, (6, 7234): 4, (6, 20606): 4, (6, 22951): 4, (6, 7318): 4, (6, 15332): 4, (9, 21183): 4, (9, 23028): 4, (9, 1572): 4, (1, 25031): 3, (1, 5829): 3, (1, 14458): 3, (3, 14387): 3, (3, 9574): 3, (3, 21061): 3, (4, 18143): 3, (4, 3306): 3, (4, 17798): 3, (4, 2250): 3, (5, 9982): 3, (5, 5999): 3, (6, 15727): 3, (0, 16008): 2, (0, 1304): 2, (0, 5210): 2, (0, 17798): 2, (1, 20000): 2, (1, 19326): 2, (1, 3820): 2, (1, 25173): 2, (1, 21843): 2, (2, 20662): 2, (3, 7521): 2, (3, 14479): 2, (3, 8109): 2, (3, 18311): 2, (4, 2556): 2, (5, 23940): 2, (5, 1823): 2, (5, 18733): 2, (6, 3131): 2, (6, 947): 2, (6, 18540): 2, (6, 4756): 2, (6, 2743): 2, (6, 7692): 2, (6, 20263): 2, (6, 8670): 2, (6, 2674): 2, (6, 20050): 2, (6, 13274): 2, (6, 17876): 2, (6, 7538): 2, (6, 11098): 2, (6, 4296): 2, (6, 2417): 2, (6, 21421): 2, (6, 19256): 2, (6, 16739): 2, (7, 10908): 2, (7, 4439): 2, (7, 9492): 2, (8, 7027): 2, (8, 599): 2, (8, 4439): 2, (9, 16030): 2, (9, 6856): 2, (9, 24320): 2, (9, 15978): 2, (1, 6454): 1, (1, 14482): 1, (1, 2643): 1, (1, 7338): 1, (2, 21061): 1, (2, 1486): 1, (4, 4296): 1, (4, 23940): 1, (4, 5775): 1, (5, 24133): 1, (5, 2743): 1, (5, 11472): 1, (5, 19336): 1, (5, 20606): 1, (5, 2740): 1, (5, 9479): 1, (5, 14200): 1, (6, 22175): 1, (6, 13104): 1, (6, 10435): 1, (6, 1891): 1, (6, 22353): 1, (6, 4852): 1, (6, 20943): 1, (6, 23965): 1, (6, 13494): 1, (7, 1300): 1, (7, 12497): 1, (7, 2788): 1, (8, 13879): 1, (8, 2404): 1})

Can you provide a sample input and corresponding output please ? It helps a lot for understanding the code then checking that it stays correct through performance optimization passes. Also, how fast would you like it to be ? (your target input being 50K words and 300K context windows). — Lenormju, Feb 11 '22 at 11:11
I mean for now my code runs through the 50k words and the 300k context windows in half an hour roughly. But I expect the final corpus will consist of 500k words and probably at least 100 times more context windows. Which would add up to 1000 times of half an hour which is obviously too long. I dont mind it running for a day or two. — bievjucs, Feb 11 '22 at 11:48
Ok there is probably a bug I will fix it and come back to you maybe it solves the problem already. Ok fixed the bug problem remains. Still to slow. I will add the output and the Samples now. — bievjucs, Feb 11 '22 at 12:14

Lenormju · Answer 1 · 2022-02-11T15:19:46.443

First let's define a decorator so that we can time the whole function :

import time

def measured_duration(func):  # decorator to return the call duration and the result
    def wrapper(*args, **kwargs):
        time_before = time.time()
        result = func(*args, **kwargs)
        time_after = time.time()
        return (time_after - time_before) * 1000, result
    return wrapper

which can be used like that :

import os
duration, result = measured_duration(os.listdir)(".")
print(duration, result)
# 1.5974044799804688e-05 ['venv', 'test.csv', '__pycache__', ...]

Then let's define some types (it does not speed things up, but it helped me understand the problem) :

from collections import Counter
from typing import Tuple, Dict, Sequence


ContextWindow = Tuple[str, ...]
TokensIndexes = Dict[str, int]
Cooccurrences = Counter[Tuple[int, int], int]
#                 token index ^      ^ index in context window

def count_cooccurrences(windows: Sequence[ContextWindow], index_by_token: TokensIndexes) -> Cooccurrences:
    ...

(yes I could have used NewType, but did not wanted to)

So we can use it like that :

def main():
    # the example data
    windows = [
        ['Gerücht', 'drücken', 'Kurs'],
        ['Konkursgerüchte', 'drücken', 'Kurs'],
        ['Kurs', 'Aktie'],
        ['Kurs', 'Amazon-Aktie'],
        ['der', 'Amazon-Aktie'],
        ['der', 'Aktie', 'Begleitet'],  # <------ changed
        ['Begleitet', 'Gerücht', 'Konkurs'],
        ['Begleitet', 'Marktgerüchten', 'Konkurs'],
        ['begleiten', 'Gerücht', 'Konkurs'],
        ['begleiten', 'setzt fort', 'Aktie', 'Talfahrt'],
    ]
    token_index_list = {
        'Begleitet': 0,  # <------- modified
        # 'Browser-gestützte': 0,  # -----
        'Akzeptanzstellen': 1,
        'Nachschubplanung': 2,
        'persönlichen': 3,
        'Unionsfraktion': 4,
        'Wired': 21122,
        'Hauptfigur': 6,
        'Strafgesetz': 7,
        'Computer-Hersteller': 8,
        'Rückschläge': 9,
    }

    duration, cooccurrences = measured_duration(count_cooccurrences)(windows, token_index_list)
    print(duration, cooccurrences)

main()

I changed a bit the data your provided so that there is some matches.

Now for the function :

def count_cooccurrences(windows: Sequence[ContextWindow], index_by_token: TokensIndexes) -> Cooccurrences:
    return Counter(
        (token_index, context_index)
        for (token, token_index), window_tokens in itertools.product(
            index_by_token.items(), windows
        )
        for context_index in (
            index_in_window
            for index_in_window, window_token in enumerate(window_tokens)
            if token == window_token
        )
        if context_index != -1
    )

5.269050598144531e-05 Counter({(0, 0): 2, (0, 2): 1})
# on my computer, I always get ~0.055 ms

I made it directly return a Counter object, which gets fed an iterable for initialisation, which contains pairs of (token_index_in_corpus, token_index_in_context_window). Sadly, we want to count only the matches and there is no find method on tuple/list (it exists for str only) so that we have a whole for context_index loop that is meant to iterate usually 0 times, sometime 1 time. But it supports having several times the token in the context window, if this were ever the case.

Please let me know if this run fast enough for you. Next step would be to make it parallel using several processes (because it is a CPU-bound task).

EDIT

For multiprocessing your data, first let's define a helper function :

from itertools import zip_longest
################################################################################
# copied from https://docs.python.org/3/library/itertools.html#itertools-recipes
def grouper(iterable, n, *, incomplete='fill', fillvalue=None):
    "Collect data into non-overlapping fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, fillvalue='x') --> ABC DEF Gxx
    # grouper('ABCDEFG', 3, incomplete='strict') --> ABC DEF ValueError
    # grouper('ABCDEFG', 3, incomplete='ignore') --> ABC DEF
    args = [iter(iterable)] * n
    if incomplete == 'ignore':
        return zip(*args)
    if incomplete == 'fill':
        return zip_longest(*args, fillvalue=fillvalue)
    if incomplete == 'strict':
        return zip(*args, strict=True)
    else:
        raise ValueError('Expected fill, strict, or ignore')
################################################################################

Then the parallel implementation :

import concurrent.futures
from functools import reduce


WINDOW_CHUNK_SIZE = 5
CORPUS_TOKENS_CHUNK_SIZE = 5
# means that each task will consist of searching for CORPUS_TOKENS_CHUNK_SIZE tokens in WINDOW_CHUNK_SIZE windows
# you can fiddle with these to find an optimum, here I used 5 and 5 for demo purposes


def unpack_count_cooccurrences(args):
    # cf https://stackoverflow.com/q/60002957/11384184
    # the ProcessPoolExecutor.map make the target function be called with a tuple of its args, so we unpack it
    return count_cooccurrences(args[0], args[1])


def count_cooccurrences_parallel(windows: Sequence[ContextWindow], index_by_token: TokensIndexes) -> Cooccurrences:
    with concurrent.futures.ProcessPoolExecutor(max_workers=None) as executor:
        all_params = (  # generator
            (windows_chunk, dict(index_by_token__chunk))
            for windows_chunk, index_by_token__chunk in itertools.product(
                grouper(windows, WINDOW_CHUNK_SIZE, incomplete="ignore"),
                grouper(index_by_token.items(), CORPUS_TOKENS_CHUNK_SIZE, incomplete="ignore"),
            )
        )
        tasks = (executor.submit(unpack_count_cooccurrences, params) for params in all_params)  # generator
        return reduce(lambda counter, future_counter: counter + future_counter.result(),  # could not use operator.add
                      concurrent.futures.as_completed(tasks),  # to reduce progressively
                      Counter())  # starting from an initial value

which can be called similarly to the first one :

    duration, cooccurrences = measured_duration(count_cooccurrences_parallel)(windows, token_index_list)
    print(duration, cooccurrences)

13.42153549194336 Counter({(0, 0): 2, (0, 2): 1})

The result is much slower (13ms vs 0.05) because of the cost overhead. For a dataset this small, multiprocessing is really overkill. But for your larger dataset, it should provide good performances.
Again, let me know of the time it took to run.

I implemented it. So far it doesn't work as my counter is printed empty. But at least it fails quickly. If I add the line Cooccurrences = Counter[Tuple[int, int], int] I get: line 366, in Cooccurrences = Counter[Tuple[int, int], int] TypeError: 'type' object is not subscriptable If I delete that and and the relating -> Cooccurrences: after the function. I compiles but the counter is printed empty. I will look if I missed sth. but I wanted to give you that update. Thanks a lot for now already. — bievjucs, Feb 11 '22 at 16:28
Now it runs through. For some reason it did not let me use your types. Anyway I will see how long it takes and if it is faster this way or not. — bievjucs, Feb 11 '22 at 16:35
One more thing. It should not happen that a token appears more then once in a context window. Or in very rare cases without meaning for example when the dependency parser made a wrong prediction or the sentence that was parsed was wrong. — bievjucs, Feb 11 '22 at 16:40
Is it possible with your way to just create a counter for the first ~1000 entries of the tokens? — bievjucs, Feb 11 '22 at 16:42
@bievjucs yes, with [`itertools.islice`](https://docs.python.org/3/library/itertools.html#itertools.islice). For the two versions, replace `index_by_token.items()` by `itertools.islice(index_by_token.items(), 1000)` or whatever amount you want. — Lenormju, Feb 11 '22 at 17:13
ok so for 1000 tokens. My original version takes 32098ms. Your code took 171357 ms. I will try the parallelization now. Or tomorrow maybe. — bievjucs, Feb 11 '22 at 18:16
so in the parallelization, I have to replace ```index_by_token.items() by itertools.islice(index_by_token.items(), 1000)``` in both the count_parallel and count function? I did that. And it took 48762ms with 1000 tokens. Window_chunk_size = 20k and corpus_token_chunks = 100. I will play around with that a bit. — bievjucs, Feb 11 '22 at 18:40
I got it down to roughly 43 seconds with different chunks. And then for 8000 tokens I scaled the token chunk size by 8 times and the time increased also by 8. So it is faster then your code without parallelization but not faster then my original code. — bievjucs, Feb 11 '22 at 19:53
@bievjucs then I'm afraid we have reached the limits of a solution in pure Python, either you will have to use other tools (languages or libraries) or use another algorithm for your problem. For example, instead of doing it in Python, maybe a database could do the search faster (using indexes, low-level primitives and such). — Lenormju, Feb 13 '22 at 13:44
Looking at [this answer](https://stackoverflow.com/a/18672200/11384184) seems to indicate that it may be better to use directly [`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html), especially with a very large number of jobs. I may give it a try. — Lenormju, Feb 15 '22 at 15:09
I got it down from half an hour to 30 seconds using multiprocessing and pure c code using cython. It was necesarry to reduce the list of contextchains to a list of lists of indexes. So every index relates to one token in the vocabulary. Then I had to make each context chain the same length. I chose 10 because no contextchain is originally longer than 6 or 7. I gave the additional elements a value bigger then the size of the vocabulary. Then I could transform this list of list of indexes into a 1Dimensional Array in C and loop over everything without any python overhead in the nested loop. — bievjucs, Feb 22 '22 at 20:52
@bievjucs Nice ! Indeed, C is much faster than Python. I initially wanted to stay in the Python realm, so I reached for making things concurrent. I'm still not very proficient in Cython so I did not recommend it, but in your case it seems to fit the need perfectly. Do you consider that you achieved the required performance, or is it still too slow ? — Lenormju, Feb 23 '22 at 08:06
Well I dont think its even possible to make it significantly faster. My Prof suggested that I can use a server from the university to run the count. In combination it will be fast enough. Lets say I would run it for 20.000.000 Sentences and a vocabulary of 700.000 Words, with my code improvements it would take 9 Days on my old laptop. Which I consider good enough. It was quite difficult to get behind cython at first. But it helps a lot to just look up how things are done in c because there is more information to find than about cython. If you are interested I can upload the code here. — bievjucs, Feb 23 '22 at 12:13
@bievjucs I think you should post it as another answer and mark it as accepted. If you would like to improve the performance of your current Cython code, you may consider posting another question on this site. I can still think of numerous ways to improve the speed (using `numpy` or `scipy` for vectorization, taking the page size into account to create work chunks, using different data structures, ...). — Lenormju, Feb 23 '22 at 13:44
Accessing numpy arrays is actually slower. Creating chuncs will be done once I am working with more data. I am saving the count with h5py. — bievjucs, Feb 23 '22 at 14:30

score 0 · Accepted Answer · answered Feb 23 '22 at 19:44

Using multiprocessing and cython. First thing that has to be done is to change the list of contextchains into a list of lists of indexes. Each list of indexes has to have the same length X. Each index is the index of a word in the vocabulary. So for each distinct word in a contextchain exists a distinct index in the vocabulary. For the contextchains that have a smaller length then X, additional indexes have to be added. These additional indexes should all be the same and bigger then the biggest index in the vocabulary. After that's done cython and multiprocessing can be used. For this I just generate a random list of contextchains and I assume a vocabulary of size 32000. I have 8 cores on my machine so I split the vocabulary into 8 slices but more slices would work too.

On the python side.

this generates the contextchains and the vocabindexlist slices and calls the function dosomething() in parallel.

def runcoocinparallel():
    index_lists = []
    context_list = []
    for ab in range(300000):
        list2 = []
        for ac in range(3):
            list2.append(random.randint(0, 32000-1))
        for ac in range(7):
            list2.append(50000)
        context_list.append(list2)
    for j in range(8):
        indexlist = []
        for i in range(4000):
            indexlist.append(j*4000+i)
        index_lists.append(indexlist)
    time1 = helper.current_milli_time()
    with concurrent.futures.ProcessPoolExecutor() as executor:
        a = executor.map(dosomething, index_lists, [context_list]*8)
    print(helper.current_milli_time()-time1)

in this function the call to the cython makecount function happens. Could be eliminated.

def dosomething(index_list, context_list):
    firstindex = index_list[0]
    index_list_len = len(index_list)
    array_cy.makecount(firstindex, index_list_len, 8*index_list_len, context_list)

On the cython side:

# is supposed to speed up the indexing.
@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
cpdef makecount(int first_index, int indizes_slice_len, int indizes_len, context_list):
    # allocating memory space for the arrays.
    # this is the array where the count is stored for each possible combination
    cdef int *context_list_cy = <int *> malloc(10* context_list_len*sizeof(int))
    cdef int *combinations = <int *> malloc(indizes_slice_len*indizes_len*sizeof(int))
    # type definition so cython does not check the type each loop.
    cdef int context_list_len = len(context_list)
    cdef int i
    cdef int j
    cdef int k
    cdef int context_index
    # setting the counts to zero
    for i in range(indizes_slice_len*indizes_len):
        combinations[i] = 0
    # creating an cython 1D-Array from the list of contextchains
    for i in range(context_list_len):
        for j in range(10):
            context_list_cy[i*10+j] = context_list[i][j]
    # this is where all the time is spent
    time1 = helper.current_milli_time()
    for i in range(first_index, first_index+indizes_slice_len):
        for j in range(context_list_len):
            # checking if focus_index i is inside this contextchain 
            if i == context_list_cy[j * 10] or i == context_list_cy[j * 10 + 1] or i == context_list_cy[j * 10 + 2] or \
               i == context_list_cy[j * 10 + 3] or i == context_list_cy[j * 10 + 4] or \
               i == context_list_cy[j * 10 + 5] or i == context_list_cy[j * 10 + 6] or \
               i == context_list_cy[j * 10 + 7] or i == context_list_cy[j * 10 + 8] or i == context_list_cy[j * 10 + 9]:
                # if so, increase the count of every valid context_index
                for k in range(10):
                        context_index = context_list_cy[j*10+k]
                        if not (i == context_index or context_index == 50000):
                            combinations[(i-first_index)*indizes_len+context_index] += 1
    print(helper.current_milli_time()-time1)

Note: for bigger context_chain_lists the RAM-Size becomes a problem.

Fast way to do word co_occurrence count in python on big corpus

2 Answers2