2

I'm struggling for some time to improve the execution time of this piece of code. Since the calculations are really time-consuming I think that the best solution would be to parallelize the code. The output could be also stored in memory, and written to a file afterwards.

I am new to both Python and parallelism, so I find it difficult to apply the concepts explained here and here. I also found this question, but I couldn't manage to figure out how to implement the same for my situation. I am working on a Windows platform, using Python 3.4.

for i in range(0, len(unique_words)):
    max_similarity = 0        
    max_similarity_word = ""
    for j in range(0, len(unique_words)):
        if not i == j:
            similarity = calculate_similarity(global_map[unique_words[i]], global_map[unique_words[j]])
            if similarity > max_similarity:
                 max_similarity = similarity
                 max_similarity_word = unique_words[j]
    file_co_occurring.write(
        unique_words[i] + "\t" + max_similarity_word + "\t" + str(max_similarity) + "\n")

If you need an explanation for the code:

  • unique_words is a list of words (strings)
  • global_map is a dictionary whose keys are words(global_map.keys() contains the same elements as unique_words) and the values are dictionaries of the following format: {word: value}, where the words are a subset of the values in unique_words
  • for each word, I look for the most similar word based on its value in global_map. I wouldn't prefer to store each similarity in memory since the maps already take too much.
  • calculate_similarity returns a value from 0 to 1
  • the result should contain the most similar word for each of the words in unique_words (the most similar word should be different than the word itself, that's why I added the condition if not i == j, but this can be also done if I check if max_similarity is different than 1)
  • if the max_similarity for a word is 0, it's OK if the most similar word is the empty string
Community
  • 1
  • 1
van
  • 367
  • 4
  • 13
  • I don't think you can run that exact loop in parallel without storing some similarity values in memory. Can you describe what the expected output of the code is? If you are looking for the most similar word to *each* word in the set, I don't think this does that. – Matt Mar 23 '15 at 18:46
  • Yes, it is supposed to find the most similar word for each word. Should it loop all words in the inner loop too, or the error is somewhere else? Also if you have a solution that stores the results in memory, please post it and I'll find a way to make some space. – van Mar 23 '15 at 19:08
  • 2
    I think the problem is in both loops. The outer loop skips the last word, so you won't find the most find the most similar word to it. The inner loop would also need to see all words. For example, unless I missed something, with your current code and a word list [a,b,c] where b is more similar to a than to c, your output would be: a is most similar to b (the a loop sees b), b is most similar to c (the b loop doesn't see a, so it picks c instead) and there would be no output for c. – Matt Mar 23 '15 at 19:18
  • You are right. I just edited the question better explaining the expected output, I will edit it again now fixing this as you referred. Thanks. – van Mar 23 '15 at 19:20
  • Not sure if this would work in Python 3, but this might be worth looking into: http://stackoverflow.com/questions/2846653/python-multithreading-for-dummies/28463266#28463266 – Ryan Mar 23 '15 at 20:02
  • It would likely be more profitable to find another algorithm that suits your needs, as no amount of parallelization is going to improve the overall speed of an O(N^2) algorithm (with a sufficiently large N). – Makoto Mar 23 '15 at 20:10
  • Do you have some other algorithm in mind? Because I really couldn't think of any. I would be satisfied if the execution time is a few hours, because I need to execute this only once and then work with the stored results. – van Mar 23 '15 at 20:22
  • I tried a different approach to solve this problem, with (I think) a simpler loop to parallelize and I got stuck again, so [here](http://stackoverflow.com/questions/29222986/parallelize-this-nested-for-loop-in-python) is the new question. – van Mar 24 '15 at 00:41

1 Answers1

1

Here is a solution that should work for you. I ended up changing a lot of your code so please ask if you have any questions.

This is far from the only way to accomplish this, and in particular this is not a memory efficient solution.

You will need to set max_workers to something that works for you. Usually the number of logical processors in your machine is a good starting point.

from concurrent.futures import ThreadPoolExecutor, Future
from itertools import permutations
from collections import namedtuple, defaultdict

Result = namedtuple('Result', ('value', 'word'))

def new_calculate_similarity(word1, word2):
    return Result(
        calculate_similarity(global_map[word1], global_map[word2]),
        word2)

with ThreadPoolExecutor(max_workers=4) as executer:
    futures = defaultdict(list)
    for word1, word2 in permutations(unique_words, r=2):
            futures[word1].append(
                executer.submit(new_calculate_similarity, word1, word2))

    for word in futures:
        # this will block until all calculations have completed for 'word'
        results = map(Future.result, futures[word])
        max_result = max(results, key=lambda r: r.value) 
        print(word, max_result.word, max_result.value, 
            sep='\t', 
            file=file_co_occurring)

Here are the docs for the libraries I used:

Matt
  • 724
  • 4
  • 11
  • Thanks for the solution offered. I tested it for a small input, and the execution time was worse. I've read that this can be normal if the input is not that big, right? – van Mar 23 '15 at 20:57
  • @nicck, yes it can be slower for small inputs as there is some overhead to set up the thread pool and also overhead to move data between threads. Its hard to say whether or not the speed will improve for larger inputs. – Matt Mar 23 '15 at 21:05
  • @nicck, I forgot about the gil, that might be part of the problem here. You could try a ProcessPoolExecutor instead of the ThreadPoolExecutor. It may or may not work though. – Matt Mar 23 '15 at 21:52
  • there's a bug in the code but I really couldn't figure out where. It outputs a similarity of 1 for completely different words, and the error is definitely not in the `calculate_similarity` function, so it may be in the parameters given to it. I tried a different approach to solve this problem, with (I think) a simpler loop to parallelize and I got stuck again, so [here](http://stackoverflow.com/questions/29222986/parallelize-this-nested-for-loop-in-python) is the new question. – van Mar 24 '15 at 00:41
  • @nicck, is global_map set up before this starts, or is it updated while it is running? – Matt Mar 24 '15 at 01:09
  • it is calculated before this – van Mar 24 '15 at 01:13