How to Profile Python Database Code

Question

I have a small Python project (~5000 lines) that runs certain heavy statistical calculations on medium-sized Postgres database tables (~50 million entries), using the pg library.

The nature of the statistical calculations means the code takes a while to run and I want to profile both the Python and PSQL sides of my code. I tried my go-to Python profiler (cProfile), but it seems to be blind to the PSQL side.

How can I go about profiling both the Python and the PSQL in my system?

what is the expected output of the PSQL side -- wouldn't the PSQL call in Python give you enough information? — johntellsall, Aug 25 '14 at 23:16
Well, the PSQL queries are, by necessity, quite involved. They are not just `select * from table`. They do quite a lot of statistics and data manipulation. In many cases, I *think* breaking the queries into smaller parts (so each component could be profiled separately) would make them slower, as I would need to create intermediate tables and values. I could be wrong here. — john_science, Aug 25 '14 at 23:21
Creating intermediate tables and indexes can be surprisingly quick (although certainly subject to a lot of variables). I wouldn't dismiss the split-it-up option out of hand -- often that can make a big difference. If you didn't want to use temp tables, *CTEs* (essentially query-scoped temp tables) may help and be simpler to work with. Have you tried getting the query plan? If the SQL generated is too dynamic, you could have the library dump out the SQL that will be sent to the server and then use that in *pgadmin* or *psql* to look at the query plan. — khampson, Aug 26 '14 at 03:51

score 3 · Answer 1 · answered Aug 26 '14 at 09:28

Basically, you can't.

You can measure performance and execution plans in PostgreSQL queries but I can't even begin to picture how you'd integrate that with a profile for an imperative programming language. You can completely replace the Python with Perl that generates the same queries and it wouldn't make a blind bit of difference to PostgreSQL anyway.

So - there are two interesting questions to answer. Firstly, what queries are taking up all your time? Secondly, is the query planner doing a good job?

For the first, there are a number of logfile processors that will build stats for you but I would recommend the pg_stat_statements extension. Leave this running and it will count up how many times you run different queries and how long they take.

For the second, you'll want to check the query plans produced via EXPLAIN and see if you need to alter the indexes available, or perhaps the queries themselves. That's too large a topic for here, but the wiki is a good starting point. There's an excellent online tool to help illustrate query-plans too.

score 3 · Answer 2 · edited May 23 '17 at 11:46

I was just reading the cProfile doc.

It measures CPU time, which won't count I/O or out-of-process time. I assume your SQL queries are dominated by I/O or out-of-process time, so naturally you won't see them. (You can make it measure wall-clock time, by providing your own timer function.)
It reports, in column cumtime, inclusive time by function, but not by line of code, so you don't get any information about the percent of time that individual lines cost. If it reported percent wall-clock time by line of code, then you could see which queries were costing the most.

Here's the technique a lot of people use. It doesn't worry about accuracy of timing. If a particular query accounts for around 50% or 60% of your waiting time, do you really care which percent it is?

In fact, when you've seen it on as few as two samples, you know it's a big one, and you know precisely what it is.

How to Profile Python Database Code

2 Answers2