1

I have to search a large table of scientific journal articles for some specific articles I have in a separated file. My approach is to build a search index from the large table using Whoosh, and then search for each article of the separated file in the index. This works well, but takes too long (~2 weeks). So I wanted to speed things up a bit by implementing multiprocessing, and that's where I'm struggling.

The essential part of my "simple" search without multiprocessing looks as follows:

articles = open('AuthorArticles.txt', 'r', encoding='utf-8').read().splitlines()

fs = FileStorage(dir_index, supports_mmap=False)
ix = fs.open_index()
with ix.searcher() as srch:
   for article in articles:
      # do stuff with article
      q = QueryParser('full_text', ix.schema).parse(article)
      res = srch.search(q, limit=None)
      if not res.is_empty():
         with open(output_file, 'a', encoding='utf-8') as target:
            for r in res:
               target.write(r['full_text'])

Now, what I specifically want to achieve is that the index is loaded into memory and then multiple processes access it and search for the articles. My attempt so far looks like this:

 articles = open('AuthorArticles.txt', 'r', encoding='utf-8').read().splitlines()

def search_index(article):
   fs = FileStorage(dir_index, supports_mmap=True)
   ix = fs.open_index()
   with ix.searcher() as srch:
      result = []
      for a in article
         # do stuff with article
         q = QueryParser('full_text', ix.schema).parse(q)
         res = srch.search(q, limit=None)
         if not res.is_empty():
            for r in res:
               result.extend[r['full_text']]
   return result

if __name__ == '__main__':
   with Pool(4) as p:
      results = p.map(search_index, articles, chunksize=100)
      print(results)

But, as far as I understand, this way each single process loads the index into the memory (which won't work since the index is fairly large).

Is there any way I can achieve what I need in a relatively simple way? Basically all I want to do is searching the index by using the whole computational power at hand.

TigerhawkT3
  • 48,464
  • 6
  • 60
  • 97
smint75
  • 11
  • 1

2 Answers2

3

You could use multiprocessing shared ctypes to share the memory between processes. You can pass lock=False to Value or Array if you need read access only.

Maybe this answer can help you to move further:
How to combine Pool.map with Array (shared memory) in Python multiprocessing?

Community
  • 1
  • 1
matino
  • 17,199
  • 8
  • 49
  • 58
1

If you use mmap (with the access parameter set to ACCESS_READ) to "read" the file, the virtual memory subsystem of your operating system will ensure that only one copy will be loaded into memory.

Since you initialize the FileStorage with supports_mmap=True, it will use mmap and your problem should be already solved. :-)

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • Great! Thanks a lot for your help, will try that. Does that work for every OS the same? – smint75 Jun 09 '15 at 08:44
  • As you can see in the documentation, the parameters for `mmap` are different between POSIX platforms and ms-windows. If `FileStorage` takes that into account, it should be fine. Check the source code. – Roland Smith Jun 09 '15 at 17:10