Understanding implications of using multiprocessing to manage long-running db queries

Question

We are dealing with an extremely (over several minutes) long-running Django orm keyword text query that is causing timeouts.

For reasons we've yet to diagnose, when a keyword search is made on our postgres db containing over 1 million text records, the timeout occurs despite having implemented what we believe is the recommended approach to improving performance in such as scenario (e.g GIN indexes and to_tsvectors).

There are probably many stones yet unturned, but as a temporary fix, initial Googling points to a solution to manage/stop the hanging query using the multiprocessing module. For example:

import multiprocessing
import time

def run_problem_db_query(textFileSearchParameter, returned_statements):
    # this is the db query that hangs.....
    retrieved_text_files = statements.extra(where=["textfiles_tsv @@ plainto_tsquery(%s)"], params=[textFileSearchParameter])
    return retrieved_text_files

retrieved_text_files = None

p = multiprocessing.Process(target=run_problem_db_query, name="run_problem_db_query", args=(textSearch, retrieved_text_files))
p.start()
time.sleep(10)
p.terminate()
p.join()

Even though a few blogs and SO posts have recommended a version of this approach, what are the potential problems (if any)?

Ultimately, we want to fix the query of course and any potential issues with the db itself, but aside from consuming a potentially-large memory footprint (as mentioned here), what else are the potential pitfalls of such an approach?

score 1 · Answer 1 · answered Oct 02 '18 at 21:28

1

You don't need any multi-processing module, just set statement_timeout to the maximum duration of a query.

And yes, find and tune the query that causes your problem.

answered Oct 02 '18 at 21:28

Laurenz Albe

209,280
17
206
263

Isn’t that something we have to do at the settings level where the db config is defined? Or can you actually specify that for a specific query? – AdjunctProfessorFalcon Oct 02 '18 at 21:32
You can set it at any time using the `SET` SQL statement. – Laurenz Albe Oct 02 '18 at 21:36
@AdjunctProfessorFalcon: It's actually *discouraged* to set it in postgresql.conf. [The manual:](https://www.postgresql.org/docs/10/static/runtime-config-client.html#GUC-STATEMENT-TIMEOUT) `Setting statement_timeout in postgresql.conf is not recommended because it would affect all sessions.` – Erwin Brandstetter Oct 02 '18 at 21:42
@LaurenzAlbe given the current Django ORM query, would you mind providing an example of how to set that up? – AdjunctProfessorFalcon Oct 02 '18 at 21:44
I think it is ok to set it in `postgresql.conf` as long as you set it to a value that exceeds the maximum expected duration of any query in your application. – Laurenz Albe Oct 02 '18 at 21:45
I don't know anything about django, but hopefully it is able to run a plain SQL query. – Laurenz Albe Oct 02 '18 at 21:47

score 0 · Answer 2 · edited Oct 02 '18 at 17:55

0

While multiple process are running for this query, the number of queries to DB at point will increase which will increase load on DB, any other priority queries from other sources will be affected.

edited Oct 02 '18 at 17:55

Joe Love

5,594
2
20
32

answered Oct 02 '18 at 17:29

Yenni Suresh

1
2

Understanding implications of using multiprocessing to manage long-running db queries

2 Answers2