0

I have a table with about 13 million rows. Each row represents a measurement of a certain type for a particular project for a particular hour of a particular day.

I have a query that finds sums or averages of these values depending on the measurement type. It's very slow, like a few minutes.

We have certain reporting pages that utilize the results of this query, but it's not acceptable for the pages to take multiple minutes to load. My solution so far has been to cache the results of the query in what I'm calling a summary table.

The problem is that the nightly-run script that refreshes the summary table takes too long to run. I'm not even trying to refresh the whole summary table at once but it still takes too long. (By "too long" I mean errors are raised and the refresh job is not completed.)

I have a hunch that the challenges I'm facing are a result of going about things in the wrong way, and that the solution is maybe not to tweak a few things to shave off the query run time by 1% here and there but to approach the problem in a completely different way.

Any suggestions would be appreciated. I apologize if I'm not asking this question in a very good way; I'm not sure how to formulate it better. Happy to provide clarification or more details.

Here's a simplified version of the query that takes forever to run. (Even this simplified version takes quite a long time.)

  select date(calc_dt),
         project_id,
         calculation_type_cd,
         sum(result)
    from calc_calculation_results
group by date(calc_dt),
         project_id,
         calculation_type_cd

The nightly job is basically a SELECT INTO that takes the results of this query and puts them into my summary table. The result column is the value we're interested in for the purpose of the reports.

Jason Swett
  • 43,526
  • 67
  • 220
  • 351
  • Are you utilizing any indices? What errors are being raised? Are you saying the query just dies at some point? – Tim Biegeleisen Dec 14 '16 at 01:42
  • 2
    Really Jason with a 14k rep you really ought to know that a question as nebulous as this is just not answerable. – RiggsFolly Dec 14 '16 at 01:44
  • @TimBiegeleisen I'm getting [this error](http://stackoverflow.com/questions/5836623/getting-lock-wait-timeout-exceeded-try-restarting-transaction-even-though-im) which I happened to ask another question about five years ago. A `SHOW INDEX FROM` on my table does reveal a number of indices but I don't know how to tell what's relevant. – Jason Swett Dec 14 '16 at 01:45
  • @RiggsFolly Suggestions for a better way to ask a question about my challenge are welcome. I'm out of my depth with this issue and my knowledge level is so low that I don't even know where to start. – Jason Swett Dec 14 '16 at 01:48
  • Well adding some schema information might help. Also some code from the big jobby that takes to long to run. – RiggsFolly Dec 14 '16 at 01:50
  • @RiggsFolly Good idea. I added the query and some information about the job. (The end of the day when most of my mental energy has been used up is also probably not the best time for me to try to post questions about particularly tough issues...I'm getting off the computer soon.) – Jason Swett Dec 14 '16 at 01:55
  • What is the type of the `calc_dt` column? I'm not sure you need to wrap it with `date()`. – Tim Biegeleisen Dec 14 '16 at 01:55
  • 1
    Again, _a simplified version_ What use is it if we fix a query you are not actually using – RiggsFolly Dec 14 '16 at 01:57
  • @TimBiegeleisen `calc_dt` is of type `datetime`. The reason I'm wrapping it with `date()` is because the calculations are recorded on an hourly basis but I need each row in the resultset to represent a day. – Jason Swett Dec 14 '16 at 01:58
  • @RiggsFolly The real version isn't meaningfully more complex than the simplified version, there's just more noise in it. – Jason Swett Dec 14 '16 at 01:59
  • 1
    `SHOW CREATE TABLE calc_calculation_results;`, or at least tell us whether there is a *single* index over the columns (project_id, calculation_type_cd) *in that specific order*, with project_id as the leftmost column in the index... and if there is, then change the grouping to `GROUP BY project_id, calculation_type_cd, DATE(calc_dt)` and describe the performance difference. – Michael - sqlbot Dec 14 '16 at 23:53
  • If you are rescanning 13M rows every night, _that_ is the main problem, not the schema, not the indexes, not the datatypes. (See my answer.) – Rick James Dec 15 '16 at 04:33

1 Answers1

0

Summary table(s) -- good. Rebuilding them -- bad. Instead, augment them incrementally every night.

With a Summary table, the main table needs very few indexes, thereby making it more efficient to load.

The Summary tables have whatever indexes are appropriate for the queries.

More discussion of Summary Tables

Your simplified version might become

INSERT INTO Summary ( date, project_id, type_cd, sum_result )
    select CURDATE() - INTERVAL 1 DAY,
           project_id,
           calculation_type_cd,
           sum(result)
    from calc_calculation_results
    WHERE calc_dt >= CURDATE() - INTERVAL 1 DAY
      AND calc_dt  < CURDATE()
    group by project_id,
             calculation_type_cd

It might have

PRIMARY KEY(date, project_id, type_cd),
INDEX(project_id, date),
INDEX(type_cd, date)
Rick James
  • 135,179
  • 13
  • 127
  • 222