0

I'm trying to read a 400 MB text file and I'm reading it in chunks as I need to extract the words from it and I'm trying to use thread pool but it running longer than I ran as a single process.

Below are the two functions,

def process_text(self): # processing the text file 
    # if valid path then enter into try statement
    try:
        with open(self.path, 'r') as f:
            pool = ThreadPool(20) # creating a thread pool of size 20
            print("Processing text file ...")
            while True:
                data = list(islice(f, 100)) # slicing input file in size of 100 lines
                if not data: break
                pool.map(self.word_counting, data) #calling word_counting function which is extracting the words and storing the words in a dictionary

                pool.close()
                pool.join()


def word_counting(self, cur_list):
    for line in cur_list:
        for word in re.findall(r'\w{2,}', line):# will check for word of length greater than 1 i.e >1
            self.word_dic[word] = self.word_dic.get(word, 0) + 1

Can Anyone Help with that?

Carcigenicate
  • 43,494
  • 9
  • 68
  • 117
  • Perhaps a rough dupe of [this](https://stackoverflow.com/questions/18114285/what-are-the-differences-between-the-threading-and-multiprocessing-modules)? You need multiprocessing for CPU intensive tasks. – Carcigenicate Apr 21 '20 at 23:41
  • @Carcigenicate thanks for your reply. I checked the link. In my case, I'm trying to multiprocess the "word_counting" function but I'm not seeing any improvement over there. I have tried with different pool sizes as well, such as 4 , 8 and 12 and big numbers like 100, but still, I don't see any improvement. – Master Shifu Apr 22 '20 at 01:17
  • But are you using a `ThreadPool` in your tests? – Carcigenicate Apr 22 '20 at 01:29
  • Yeah, I'm using Threadpool. pool.map to use function on chunk of data, pool.close to stop accepting any further job and pool.join to wait for all threads to finish it. – Master Shifu Apr 22 '20 at 03:59
  • A `ThreadPool` used threads, not processes (even though it's in the multiprocessing package). You want the other pool that's in that package. – Carcigenicate Apr 22 '20 at 12:27
  • @Carcigenicate : thanks, I guess I get what you're trying to say. I have made changes accordingly but still, it doesn't seem to improve the performance. I have taken consideration of the following points: 1. Ensuring I'm using Multiprocessing library for CPU bound jobs, i.e. in my case. 2. That no common resource being shared among jobs. Do you have any other inputs how can I optimize this? – Master Shifu Apr 23 '20 at 04:04

0 Answers0