I'm struggling for some time to improve the execution time of this piece of code. Since the calculations are really time-consuming I think that the best solution would be to parallelize the code. The output could be also stored in memory, and written to a file afterwards.
I am new to both Python and parallelism, so I find it difficult to apply the concepts explained here and here. I also found this question, but I couldn't manage to figure out how to implement the same for my situation. I am working on a Windows platform, using Python 3.4.
for i in range(0, len(unique_words)):
max_similarity = 0
max_similarity_word = ""
for j in range(0, len(unique_words)):
if not i == j:
similarity = calculate_similarity(global_map[unique_words[i]], global_map[unique_words[j]])
if similarity > max_similarity:
max_similarity = similarity
max_similarity_word = unique_words[j]
file_co_occurring.write(
unique_words[i] + "\t" + max_similarity_word + "\t" + str(max_similarity) + "\n")
If you need an explanation for the code:
unique_words
is a list of words (strings)global_map
is a dictionary whose keys are words(global_map.keys()
contains the same elements asunique_words
) and the values are dictionaries of the following format: {word: value}, where the words are a subset of the values inunique_words
- for each word, I look for the most similar word based on its value in
global_map
. I wouldn't prefer to store each similarity in memory since the maps already take too much. calculate_similarity
returns a value from 0 to 1- the result should contain the most similar word for each of the words in
unique_words
(the most similar word should be different than the word itself, that's why I added the conditionif not i == j
, but this can be also done if I check ifmax_similarity
is different than 1) - if the
max_similarity
for a word is 0, it's OK if the most similar word is the empty string