0

I'm trying to run the python multiprocessing library to speed up encoding of csv file. However I run into this error:

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

I did create

Search = SemanticSearch(model_path, data_path, query)
if __name__ == '__main__':

    query, flat, top_results = Search.search()

That points to the function in my class,

def setup(self):
        with open(self.data_path, newline='') as f:  # read and sort data
            reader = csv.reader(f)
            data1 = list(reader)
        self.corpus = [x for sublist in data1 for x in sublist]  # turn into 1D list
        #SemanticSearch.encode(self)
        self.texts_encodings = self.map(self.encode, self.corpus)
        end = time.time()
        print(end - self.start)


    def encode(self):
        self.start = time.time()
        return self.model.encode(self.corpus, convert_to_tensor=True, show_progress_bar=True)

In my init function I did call and set

self.map = Pool().map

Any tips something I'm missing? Thanks in advance

EDIT

class SemanticSearch(object):
   def __init__(self, model, data, query):
       self.query = query
       self.model = SentenceTransformer(model)  ### Model location
       self.data_path = data  ###path to csv 
       self.corpus = None
       self.texts_encodings = None
       self.start = None
       self.map = Pool().map

   def setup(self):
       print('here')
       with open(self.data_path, newline='') as f:  # read and sort data
           reader = csv.reader(f)
           data1 = list(reader)
       self.corpus = [x for sublist in data1 for x in sublist]  # turn into 1D list
       # SemanticSearch.encode(self)
       self.texts_encodings = self.map(self.encode, self.corpus)
       # SemanticSearch.encode(self)
       end = time.time()
       print(end - self.start)

   def encode(self):
       self.start = time.time()
       return self.model.encode(self.corpus, convert_to_tensor=True,
                                show_progress_bar=True)  ##encode to invisible layer

   def search(self):
       SemanticSearch.setup(self)


if __name__ == "__main__":
   model_path = r'data\BERT_MODELS\fine-tuned\multi-qa-MiniLM-L6-cos-v1'
   data_path = 'data/raw_data/Jira-2_14_2022.csv'
   query = 'query'

   Search = SemanticSearch(model_path, data_path, query)

   query, flat, top_results = Search.search()

ti7
  • 16,375
  • 6
  • 40
  • 68
mat347
  • 1
  • 1
  • `self` is sent to the child process via `pickle`. There are various ways pickle sends custom classes, but generally `__init__` should not be called on the other side when it is un-pickled. Despite that, I believe that is what's happening (therefore `self.__init__` is called recursively from each child process unless `runtimeError` catches it). If you could provide a more complete example of your code it would be easier to confirm. – Aaron Feb 25 '22 at 17:32
  • @Aaron I provided more info, so I should just have it call a function and instantiate my variables from there to satisfy it? – mat347 Feb 25 '22 at 19:01
  • move the creation of `Search` inside `if __name__ == "__main__":` that is where the multiprocessing happens, and it must be prevented from happening when your file is imported in the child process. The call to `Pool()` itself is what creates child processes, not just when you call `map` – Aaron Feb 25 '22 at 20:39
  • Treat multiprocessing like writing a library where importing the library should only define functions and classes on import. Only when run as the main file should functions be executed, and instances of classes created. – Aaron Feb 25 '22 at 20:43
  • @Aaron made those changes, and semi understand what you mean, however, it now just loops thorough the init method never getting past ```self.texts_encodings = self.map(self.encode, self.corpus)```. I moved the entirety of the code that calls that class into that `if __name__ == "__main__":` `model_path = r'data' data_path = 'data' query = 'alert physician of specialty drug use' Search = SemanticSearch(model_path, data_path, query) query, flat, top_results = Search.search()` – mat347 Feb 25 '22 at 23:00
  • please edit your question to include this code. Comments can't capture correct formatting, making it hard to follow what exactly is going on... – Aaron Feb 25 '22 at 23:21
  • @Aaron I made those edits you needed. Am I putting this ```if __name__ ==``` in the right location? – mat347 Feb 28 '22 at 18:15
  • 1
    Does this answer your question? [where to put freeze\_support() in a Python script?](https://stackoverflow.com/questions/24374288/where-to-put-freeze-support-in-a-python-script) – ti7 Feb 28 '22 at 18:15
  • @mat347 yes, it is the correct location, which should solve the `RuntimeError` yes? The next problem you have may be that an error is raised inside the call to `Pool.map()` which is not being caught and handled properly. This may be worth a separate question. – Aaron Feb 28 '22 at 18:35
  • https://stackoverflow.com/search?q=An+attempt+has+been+made+to+start+a+new+process+before+the+current+process+has+finished+its+bootstrapping+phase. That's actually not totally new. Also, prepare a [mcve]. Not snippets of code where people have to guess how they should be assembled into a working program. – Ulrich Eckhardt Feb 28 '22 at 19:24

0 Answers0