I have to search a large table of scientific journal articles for some specific articles I have in a separated file. My approach is to build a search index from the large table using Whoosh, and then search for each article of the separated file in the index. This works well, but takes too long (~2 weeks). So I wanted to speed things up a bit by implementing multiprocessing, and that's where I'm struggling.
The essential part of my "simple" search without multiprocessing looks as follows:
articles = open('AuthorArticles.txt', 'r', encoding='utf-8').read().splitlines()
fs = FileStorage(dir_index, supports_mmap=False)
ix = fs.open_index()
with ix.searcher() as srch:
for article in articles:
# do stuff with article
q = QueryParser('full_text', ix.schema).parse(article)
res = srch.search(q, limit=None)
if not res.is_empty():
with open(output_file, 'a', encoding='utf-8') as target:
for r in res:
target.write(r['full_text'])
Now, what I specifically want to achieve is that the index is loaded into memory and then multiple processes access it and search for the articles. My attempt so far looks like this:
articles = open('AuthorArticles.txt', 'r', encoding='utf-8').read().splitlines()
def search_index(article):
fs = FileStorage(dir_index, supports_mmap=True)
ix = fs.open_index()
with ix.searcher() as srch:
result = []
for a in article
# do stuff with article
q = QueryParser('full_text', ix.schema).parse(q)
res = srch.search(q, limit=None)
if not res.is_empty():
for r in res:
result.extend[r['full_text']]
return result
if __name__ == '__main__':
with Pool(4) as p:
results = p.map(search_index, articles, chunksize=100)
print(results)
But, as far as I understand, this way each single process loads the index into the memory (which won't work since the index is fairly large).
Is there any way I can achieve what I need in a relatively simple way? Basically all I want to do is searching the index by using the whole computational power at hand.