0

I have to make several comparisons between strings (all vs all). for this reason I create a nested for loops like this one below:

avg_score = []
strings = get_strings()
for i in range(n)
    count = 0
    score = 0
    for j in range(n)
        if i == j:
            continue
        score += compare(strings[i], strings[j])
    avg_score[i] = score/coount
    do_other_stuff

strings is a list that contains all the text strings to be compared. avg_score is the average score between string i vs all the others. Compare is an external function that us NLP algorithm to compare strings and return a score value (approx 0.25 sec per comparison)

I would like to do this comparison in parallel since it should be faster.

I tried the parfor library, joblib but I do not understand how to process these for loops. probably I have to make the j loop in parallel instead of boths but I have problems when I have to pass argument i and strings to the function.

alexmark
  • 369
  • 4
  • 20

3 Answers3

0

Your code will generate each pair twice as each for loop will go through every item of the list.

To reduce comparisons use you can use itertools

import itertools
for a, b in itertools.combinations(strings , 2):
    score += compare(a, b)

Or you can modify your loop to the following:

for i in range(n)
    count = 0
    score = 0
    for j in range(i+1,n)
        score += compare(strings[i], strings[j])

This way your total computation time should become half as there are no duplicate iterations here.

Atharva Gundawar
  • 475
  • 3
  • 10
  • `if i == j` is unnecessary here, it will always be `False`. – Guy Jun 24 '21 at 10:07
  • That does not what the op wants: For each string an average score, not one global score. Even then the result would be wrong, since you leave out about one half of the scores. – Timus Jun 24 '21 at 10:12
  • 1
    then keep a list of scores for `string[i]` and add the score for `i-j` comparison to 2 scores, calculate the average after all scores are tallied, if the `compare` function is commutative – rioV8 Jun 24 '21 at 10:14
  • @rioV8 why should one assume that `compare` is commutative? Isn't the easiest way for OP to map the problem to [this](https://stackoverflow.com/a/20888445/3896008) dead simple example for multiprocessing in python. One just needs to generate all pairs of `strings` and create a list of tuples, thus reducing it to the standard multiprocessing example – lifezbeautiful Jun 24 '21 at 10:20
  • first of all thanks for your reply...ad already said unfortunately I cannot simply reduce the number of calculus redusing the size of the for loop. I need the whole comparison (1vs all) and then take the maximum... – alexmark Jun 24 '21 at 10:28
0

One simple approach would be something like:

from multiprocessing import Pool

def comp_score(i):
    s = strings[i]
    return sum(compare(s, r) for r in strings[:i] + strings[i+1:]) / (len(strings) - 1)

if __name__ == '__main__':
    strings = get_strings()
    with Pool(8) as p:
        avg_scores = list(p.map(comp_score, range(len(strings))))

If that works for you depends a bit on what do_other_stuff entails.

EDIT: Regarding your comments. Here's a version of how it could look like. But it doesn't include the right computations, since I can't see how count is actually computed, and I also don't know how the dictionary looks like. So it's more a "that's how you could do it in principle" answer.

from multiprocessing import Pool

def comp_score(i):
    avg_score, count = 1, 2  # Placeholder for calculation of avg_score and count
    return avg_score, count

if __name__ == '__main__':
    strings = get_strings()
    results = {'avg_scores': [], 'counts': []}
    with Pool(8) as p:
        for score, count in p.map(comp_score, range(len(strings))): 
            results['avg_scores'].append(score)
            results['counts'].append(count)
Timus
  • 10,974
  • 5
  • 14
  • 28
  • do_other_stuff is a function that appends count and avg_score to a dictionary according to some parameters. the return of comp_store could be the dictionary with avg_score and count appended? – alexmark Jun 24 '21 at 10:36
  • @alexmark Unfortunately that won't work. The collection of the results has to happen in the controlling process (the one that builds the pool) because you can't easily modify one single object across different processes. What exactly is the purpose of the `count` variable? – Timus Jun 24 '21 at 11:20
  • it counts how many similarities are between i and all the js strings...basically, consider that each string has a topic, for each topic I check the similarities of the strings. For this reason, at the end of the i-j loops, for each topic count is a vector like [25, 30, 50] that means that the topic n-th has 25, 30 and 50 strings similar. then I take the maximum (or maxima) and check on the avg_score array the corresponding score value – alexmark Jun 24 '21 at 12:22
0

Here an example using multi-threading. Inside your run function you can actually do whatever you want in a multi-threading fashion.

import threading

class my_thread(threading.Thread):
    def __init__(self, strings, indexes, thread_id=0):
        super().__init__()
        self.strings = strings
        self.indexes = indexes
        self.avg_score = []
        self.id = thread_id
    
    def run(self):
        print(f'Thread {self.id} working: from {self.indexes[0]} to {self.indexes[-1]}')
        for i in self.indexes:
            if i > len(self.strings):
                break
            count = 0
            score = 0
            for j in range(len(self.strings)):
                if i == j:
                    continue
                score += compare(self.strings[i], self.strings[j])
                count += 1
            self.avg_score.append(score/count)
            # do other stuff
        print(f'Thread {self.id} done')

avg_score = []
strings = get_strings()
n = len(strings)

# Set the number of threads
num_threads = 10

threads = []
for ii in range(num_threads):
    threads.append(my_thread(strings, range(n//num_threads*ii, n//num_threads*(ii+1)), thread_id=ii))
    threads[-1].start()

for t in threads:
    t.join()
    avg_score.extend(t.avg_score)

Here a reproducible example:

import threading
import time
import random
import string

class my_thread(threading.Thread):
    def __init__(self, strings, indexes, thread_id=0):
        super().__init__()
        self.strings = strings
        self.indexes = indexes
        self.avg_score = []
        self.id = thread_id
    
    def run(self):
        print(f'Thread {self.id} working: from {self.indexes[0]} to {self.indexes[-1]}')
        for i in self.indexes:
            if i > len(self.strings):
                break
            count = 0
            score = 0
            for j in range(len(self.strings)):
                if i == j:
                    continue
                score += compare(self.strings[i], self.strings[j])
                count += 1
            self.avg_score.append(score/count)
            # do other stuff
        print(f'Thread {self.id} done')


def compare(str1, str2):
    time.sleep(0.01)
    return random.random()


strings = [''.join(random.choices(string.ascii_uppercase + string.digits, k=10)) for _ in range(100)]
n = len(strings)


# --- MULTI-THREADING ---

avg_score = []

# Set the number of threads
num_threads = 10

start_time = time.time()

threads = []
for ii in range(num_threads):
    threads.append(my_thread(strings, range(n//num_threads*ii, n//num_threads*(ii+1)), thread_id=ii))
    threads[-1].start()

for t in threads:
    t.join()
    avg_score.extend(t.avg_score)

print(f'Elapsed time with multi-threading: {time.time() - start_time} s')


# --- SINGLE THREAD ---

avg_score = []

start_time = time.time()

for i in range(n):
    count = 0
    score = 0
    for j in range(n):
        if i == j:
            continue
        score += compare(strings[i], strings[j])
        count += 1
    avg_score.append(score / count)
    # do other stuff

print(f'Elapsed time with no threads: {time.time() - start_time} s')
ALai
  • 739
  • 9
  • 18
  • And: I couldn't understand why threading would be helpful here, since the tasks are computationally heavy and not i/o bound. So I removed the `time.sleep(0.01)` from your compare function - which is kind of a release of processing resources that won't happen in the case at hand - and the speed advantage vanishes mostly. Are you sure your approach makes sense? – Timus Jun 24 '21 at 12:02
  • Your first comment is right, I've edited it. – ALai Jun 24 '21 at 12:12
  • Concerning your second comment, we don't actually know what `do_other_stuff` will be (actually, we don't know `compare` neither). Nevertheless, a multi-threading approach could benefit from physical parallelization if multiple cores are available. Trying it out with 10,000 strings, removing `time-sleep`, 5 tasks (6-cores CPU): `29.2 s` vs `39.1 s`. That's a 25% speedup – ALai Jun 24 '21 at 12:40
  • Basically, compare use NLP to compare string similarity giving back a score (around 0.25 sec per pair of strings). do_other_stuff is a function that counts how many similarities exist between i and all the js strings...basically, consider that each string has a topic, for each topic I check the similarities of the strings. For this reason, at the end of the i-j loops, for each topic count is a vector like [25, 30, 50] that means that the topic n-th has 25, 30 and 50 strings similar. then I take the maximum (or maxima) and check on the avg_score array the corresponding score value – – alexmark Jun 24 '21 at 12:52
  • @ALai You're right, my take on `do_other_stuff` and `compare` was pure conjecture. I still guess that mulitprocessing would be a better fit. But I think it comes down to @alexmark actually comparing approaches. – Timus Jun 24 '21 at 12:57
  • @ALai, i'm trying to modify the code to get at the end a dictionary with {topic: {count: [], avg_score: []}. One question...if I make the final extend, I will not be able to aggregate data from different threads, right? it means that if a same topic is found in two different threads, the final aggregation will show two topics in the aggregated dictionary. Hope my explanation is clear – alexmark Jun 24 '21 at 13:18
  • @alexmark after having joined all the threads (`t.join()` in the for loop), you can access any attribute of the threads from the outside using `threads[index].attribute_name`. In this case, it is up to you to correctly merge the results – ALai Jun 24 '21 at 13:51