1

I need some help with the code below.

GOAL:

I have a huge ".txt" file and need a script to search a string given as input by the user.

In this specific code I'm trying to speed things up by creating multiple threads (8, to be exact) and assigning slices of the ".txt" file for each thread, so each thread will search for the results in only a portion of the file.

Since I need the return value for the search function I created and that is passed to the threads, I used the "ThreadWithReturnValue" Class shown in this question here:

How to get the return value from a thread in python?

PROBLEM:

The problem is that I only get the result if it located in the FIRST thread. The results from the other threads are always empty lists.

If I create loggings inside the search function, the loggings appears in the terminal for the first thread only, while the loggings for the other threads do not show up. This makes me believe that the other threads are not even running, even though the thread list shows up the correct number of created threads.

HELP: I'm new to Python and have very little experience with threads. I'm very probably missing some basic thing here. If you can help me I would be immensely grateful.

NOTE: I would like to make the current code work; I know there are other options to accomplish the same task, but I would like to make this one work in the first place.

#! python3
#Search string in huge ".txt" file

import pprint
from threading import Thread

#Search function that takes arguments for searched value, file object, start line and end line of search
def searchFile(value, fileObj, linestart, lineend):
    results = []
    textToSearch = fileObj.readlines()[linestart:lineend]
    for line in textToSearch:
        if value in line:
            results.append(line[0:-1])
    return results

#Thread class that actually return the function's return value
class ThreadWithReturnValue(Thread):
    def __init__(self, group=None, target=None, name=None,
                 args=(), kwargs={}, Verbose=None):
        Thread.__init__(self, group, target, name, args, kwargs)
        self._return = None
    def run(self):
        print(type(self._target))
        if self._target is not None:
            self._return = self._target(*self._args,
                                                **self._kwargs)
    def join(self, *args):
        Thread.join(self, *args)
        return self._return

#File variables
fileToSearch = 'somefile.txt'
totalLines = 1000000 #Let's suppose the file has 1 million lines
sizeThread = totalLines // 8   #Each thread will cover 1/8 of the file's lines

#Main program loop
while True:    
    
    #Ask user for input
    userinput = input('Enter here your input: /n')
     
    #Open file
    SearchFileObj = open(fileToSearch)
    
    #Create empty search results list
    searchResults = []
    
    #Create empty Threads list
    searchThreads = []
    
    #Create 8 Threads

    for i in range(0, totalLines, sizeThread):
        ThreadlLineStart = i
        ThreadlLineEnd = i + sizeThread
        
        #Create Thread Objects
        threadObj = ThreadWithReturnValue(target=searchFile, args=[userinput, SearchFileObj, ThreadlLineStart, ThreadlLineEnd])
        searchThreads.append(threadObj)
        threadObj.start()
    
    #Wait Threads to finish
    for thread in searchThreads:
        threadResult = thread.join()
    #Verify thread result and append result (if any) to list
        if threadResult == []:
            continue
        else:
            searchResults.append(threadResult)
    
    #Close file
    fileToSearch.close()
    
    #Give output to user
    if searchResults == []:
        print('No results found.')
    else:
        print('The results are:')
        pprint.pprint(searchResults)
    
    break
anmattos
  • 11
  • 1
  • 1
    Note that due to the [GIL](https://en.wikipedia.org/wiki/Global_interpreter_lock), threads only speed up I/O bound problems. Since your I/O is from one *single* file, you cannot speed it up via concurrency. (That's even before some programming errors that mean your threads just *duplicate* the significant work instead of splitting it.) While we can help debug the incorrect result, using threads is inherently a red herring for your real problem. – MisterMiyagi May 18 '21 at 12:42
  • Also, [Thread.join()](https://docs.python.org/3/library/threading.html#threading.Thread.join) method does not return the return value of `target` function. – Attila Viniczai May 18 '21 at 12:48
  • @MisterMiyagi thank you. This corroborates with my the extended time I'm getting in relation to a non threaded version of the script. Nonetheless I would like to find a solution for education purposes. – anmattos May 18 '21 at 17:01
  • 1
    @AttilaViniczai thank you. I'm aware of that. This is why I'm using a specific Class for the Thread that actually return the function using join(). – anmattos May 18 '21 at 17:02

0 Answers0