Multiprocess to compare strings in multi .txt files?

Question

I have several txt files, each with about a million lines, and it takes about a minute to search for equalities. The files are saved as 0.txt, 1.txt, 2.txt,... for convenience, in_1 and searchType are user-given inputs.

class ResearchManager():
def __init__(self,searchType,in_1,file):
    self.file = file
    self.searchType = searchType
    self.in_1 = in_1
    
def Search(self):
    
    current_db = open(str(self.file) + ".txt",'r')
    .
    .
    .

    #Current file processing


if __name__ == '__main__':

n_file = 35
for number in range(n_file):
    RM = ResearchManager(input_n, input_1, number)
    RM.Search()

I would like to optimise the search process using multiprocessing, but I have not succeeded. Is there any way of doing this? Thank you.

Edit.

I was able to use threads in this way:

class ResearchManager(threading.Thread):
def __init__(self, searchType, in_1, file):
    threading.Thread.__init__(self)
    self.file = file
    self.searchType = searchType
    self.in_1 = in_1
    
def run(self):
current_db = open(str(self.file) + ".txt",'r')
.
.
.

#Current file processing

...

        threads=[]
        for number in range(n_file+1):
            
            threads.append(ResearchManager(input_n,input_1,number))

        start=time.time()
        
        for t in threads:
            t.start()
            
        for t in threads:
            t.join()
        end=time.time()

But the total execution time is even a few seconds longer than the normal for loop.

You may first implement the code with ThreadPoolExecutor. And change to ProcessPoolExecuter later. If any error raise in the transition, it is likely due to pickling objects, and refactor is needed. Make sure that the task and arguments submitted to ProcessPoolExecutor is all picklable, avoid file object, lambda/nested function, etc. — Aaron, Apr 11 '21 at 14:15
I tried to adapt what was said [here](https://stackoverflow.com/questions/20190668/multiprocessing-a-for-loop). Thanks for the suggestions, I'll have a look. — LucaT3X, Apr 11 '21 at 14:38
[`multiprocessing.dummy.ThreadPool`](https://docs.python.org/3.9/library/multiprocessing.html?highlight=multiprocessing#module-multiprocessing.dummy) is a drop-in thread-based replacement to `multiprocessing.Pool`. — Aaron, Apr 11 '21 at 14:48

score 0 · Answer 1 · answered Apr 11 '21 at 13:40

Can you show what you have tried in terms of threading ? Take a look at this article, does a good job at providing basic understanding of how the python threads work.

https://realpython.com/intro-to-python-threading/

import logging
import threading
import time

def thread_function(name):
    logging.info("Thread %s: starting", name)
    time.sleep(2)
    logging.info("Thread %s: finishing", name)

if __name__ == "__main__":
    format = "%(asctime)s: %(message)s"
    logging.basicConfig(format=format, level=logging.INFO,
                        datefmt="%H:%M:%S")

    threads = list()
    for index in range(3):
        logging.info("Main    : create and start thread %d.", index)
        x = threading.Thread(target=thread_function, args=(index,))
        threads.append(x)
        x.start()

    for index, thread in enumerate(threads):
        logging.info("Main    : before joining thread %d.", index)
        thread.join()
        logging.info("Main    : thread %d done", index)

The GIL is going to prevent a real performance increase in threading. — Klaus D., Apr 11 '21 at 14:05
I'm new to this as well, learning. Do you mind elaborating while this would prevent performance increase ? — GeorgesAA, Apr 14 '21 at 14:19

Multiprocess to compare strings in multi .txt files?

1 Answers1