I am trying to make a co_occurrence count work fast in python.
I have a list of context windows. Which is a list of words in some way connected by a dependency label after I parsed the corpus with a transition parser. Each context window has in average length 3. So a context window is not simply the 10 words before and after the focus word. That is why I need a list of lists. And why I can't just use a list of words.
I also have a dictionary where every distinct word of the corpus has a distinct index as value. But I don't think that using that speeds it up a lot. Right now I am only working with a small test corpus with 50,000 words and about 300,000 context windows my code is fast enough for that but I need it to be a lot faster because I need it to work on a much larger corpus in the end.
class CoOc:
def __init__(self):
self.co_occurrence = Counter()
def add_count(self, token_index, context, token_to_index):
try:
# some words are not in the token dictionary if they dont appear frequent enough
context_index = token_to_index[context]
except KeyError:
context_index = -1
if context_index != -1:
self.co_occurrence[(token_index, context_index)] += 1
def count_co_occurrence(self, context_chains, token_index_dic):
token_index_list = []
for key in token_index_dic:
token_index_list.append([key, token_index_dic[key]])
time1 = current_milli_time()
for token_key in token_index_list[0:100]:
token = token_key[0]
token_index = token_key[1]
for context_chain in context_chains:
if token in context_chain:
for context_word in context_chain:
if token != context_word:
self.add_count(token_index, context_word, token_index_dic)
time2 = current_milli_time()
print(time2-time1)
So in the count function from all the tokens, which are stored in the token_index dictionary, I make a list containing only a small section of it to check the time it takes.
I tried to loop through the list of all the context windows to create a list of context windows only consisting of indices before I loop through the tokens so I would not even need the add count function and just replace the line where I call add_count with the last line of add_count but that did not speed up the process.
Like this it takes 3 seconds to run, for only 100 focus words. I need it to work a lot faster to be working on the corpus I want to use it on. Is that even possible in python or do I need to implement it in another language?
So the first 10 context windows look like this:
['Gerücht', 'drücken', 'Kurs']
['Konkursgerüchte', 'drücken', 'Kurs']
['Kurs', 'Aktie']
['Kurs', 'Amazon-Aktie']
['der', 'Amazon-Aktie']
['der', 'Aktie']
['Begleitet', 'Gerücht', 'Konkurs']
['Begleitet', 'Marktgerüchten', 'Konkurs']
['begleiten', 'Gerücht', 'Konkurs']
['begleiten', 'setzt fort', 'Aktie', 'Talfahrt']
Each context word appears in the token_index_dic with some index unless it doesn't appear frequent enough in the corpus.
The first 10 elements of the token_index_list look like this:
['Browser-gestützte', 0]
['Akzeptanzstellen', 1]
['Nachschubplanung', 2]
['persönlichen', 3]
['Unionsfraktion', 4]
['Wired', 21122]
['Hauptfigur', 6]
['Strafgesetz', 7]
['Computer-Hersteller', 8]
['Rückschläge', 9]
and then when I print self.co_occurrence it looks like this:
# (focus_word_index, context_word_index): count
Counter({(1, 9479): 11, (1, 20316): 7, (2, 1722): 7, (2, 20217): 7, (2, 19842): 7, (2, 2934): 7, (3, 11959): 7, (3, 2404): 7, (3, 1105): 7, (4, 12047): 7, (4, 19262): 7, (0, 13585): 4, (1, 18525): 4, (1, 1538): 4, (1, 9230): 4, (1, 20606): 4, (1, 1486): 4, (2, 13166): 4, (2, 6948): 4, (2, 23028): 4, (2, 14051): 4, (3, 3854): 4, (3, 7908): 4, (3, 4902): 4, (3, 13222): 4, (4, 23737): 4, (4, 6785): 4, (4, 18718): 4, (5, 15424): 4, (5, 4394): 4, (5, 21534): 4, (5, 5829): 4, (5, 6513): 4, (6, 23331): 4, (6, 7234): 4, (6, 20606): 4, (6, 22951): 4, (6, 7318): 4, (6, 15332): 4, (9, 21183): 4, (9, 23028): 4, (9, 1572): 4, (1, 25031): 3, (1, 5829): 3, (1, 14458): 3, (3, 14387): 3, (3, 9574): 3, (3, 21061): 3, (4, 18143): 3, (4, 3306): 3, (4, 17798): 3, (4, 2250): 3, (5, 9982): 3, (5, 5999): 3, (6, 15727): 3, (0, 16008): 2, (0, 1304): 2, (0, 5210): 2, (0, 17798): 2, (1, 20000): 2, (1, 19326): 2, (1, 3820): 2, (1, 25173): 2, (1, 21843): 2, (2, 20662): 2, (3, 7521): 2, (3, 14479): 2, (3, 8109): 2, (3, 18311): 2, (4, 2556): 2, (5, 23940): 2, (5, 1823): 2, (5, 18733): 2, (6, 3131): 2, (6, 947): 2, (6, 18540): 2, (6, 4756): 2, (6, 2743): 2, (6, 7692): 2, (6, 20263): 2, (6, 8670): 2, (6, 2674): 2, (6, 20050): 2, (6, 13274): 2, (6, 17876): 2, (6, 7538): 2, (6, 11098): 2, (6, 4296): 2, (6, 2417): 2, (6, 21421): 2, (6, 19256): 2, (6, 16739): 2, (7, 10908): 2, (7, 4439): 2, (7, 9492): 2, (8, 7027): 2, (8, 599): 2, (8, 4439): 2, (9, 16030): 2, (9, 6856): 2, (9, 24320): 2, (9, 15978): 2, (1, 6454): 1, (1, 14482): 1, (1, 2643): 1, (1, 7338): 1, (2, 21061): 1, (2, 1486): 1, (4, 4296): 1, (4, 23940): 1, (4, 5775): 1, (5, 24133): 1, (5, 2743): 1, (5, 11472): 1, (5, 19336): 1, (5, 20606): 1, (5, 2740): 1, (5, 9479): 1, (5, 14200): 1, (6, 22175): 1, (6, 13104): 1, (6, 10435): 1, (6, 1891): 1, (6, 22353): 1, (6, 4852): 1, (6, 20943): 1, (6, 23965): 1, (6, 13494): 1, (7, 1300): 1, (7, 12497): 1, (7, 2788): 1, (8, 13879): 1, (8, 2404): 1})