2

The background to my question is the following: I have a search index implemented in whoosh, and I want to get the rankings for a batch of queries. I want to speed this up by handling multiple queries at a time, using ThreadPoolExecutor.map.

In whoosh you have a Searcher object, which is (among other things) a wrapper that handles multiple open files and has an internal state where it currently points at which file. According to the whoosh documentation you have to use one Searcher per thread in your code, but if you can share one across multiple search requests, it’s a big performance win. So my understanding is that every time the ThreadPoolExecutor opens a new thread it should be initialized with a Searcher object to the index, but when that thread is reused for another query it should keep that object instead of creating a new one per query.

My attempt was on the line of this:

from whoosh import Index
from concurrent.futures import ThreadPoolExecutor
from typing import List

class MyExecutor:
  def __init__(self, whoosh_index: Index):
    self.searcher = index.searcher()

  def __call__(self, query):
    return self.searcher.search(query)


def query_batched(queries: List[str], whoosh_index: Index, num_threads: int):
  with ThreadPoolExecutor(max_workers=num_threads) as pool:
    return pool.map(MyExecutor(whoosh_index), queries)

But this runs into an exception UnpicklingError: could not find MARK somewhere deep down in the whoosh code. This tells me that the error might be because it attempts to read a file which is not set to the beginning of a file. Does this mean that the code still uses only one Searcher across multiple threads, which is not thread-safe?

How can I fix the code so that each thread has its own Searcher object but reuses it every time it is called?

Josef
  • 304
  • 2
  • 14
  • Reading the documentation I think it's saying to use one searcher per thread and within one thread the searcher should be used multiple times to get the caching performance benefits. – pjcunningham Feb 06 '23 at 10:00

0 Answers0