The background to my question is the following: I have a search index implemented in whoosh, and I want to get the rankings for a batch of queries. I want to speed this up by handling multiple queries at a time, using ThreadPoolExecutor.map
.
In whoosh you have a Searcher
object, which is (among other things) a wrapper that handles multiple open files and has an internal state where it currently points at which file. According to the whoosh documentation you have to use one Searcher per thread in your code
, but if you can share one across multiple search requests, it’s a big performance win
. So my understanding is that every time the ThreadPoolExecutor
opens a new thread it should be initialized with a Searcher
object to the index, but when that thread is reused for another query it should keep that object instead of creating a new one per query.
My attempt was on the line of this:
from whoosh import Index
from concurrent.futures import ThreadPoolExecutor
from typing import List
class MyExecutor:
def __init__(self, whoosh_index: Index):
self.searcher = index.searcher()
def __call__(self, query):
return self.searcher.search(query)
def query_batched(queries: List[str], whoosh_index: Index, num_threads: int):
with ThreadPoolExecutor(max_workers=num_threads) as pool:
return pool.map(MyExecutor(whoosh_index), queries)
But this runs into an exception UnpicklingError: could not find MARK
somewhere deep down in the whoosh code. This tells me that the error might be because it attempts to read a file which is not set to the beginning of a file. Does this mean that the code still uses only one Searcher
across multiple threads, which is not thread-safe?
How can I fix the code so that each thread has its own Searcher
object but reuses it every time it is called?