0

We've just built a system that rolls up its data at midnight. It must iterate through several combinations of tables in order to rollup the data it needs. Unfortunately the UPDATE queries are taking forever. We have 1/1000th of our forecasted userbase and it already takes 28 minutes to rollup our data daily with just our beta users.

Since the main lag is UPDATE queries, it may be hard to delegate servers to handle the data processing. What are some other options for optimizing millions of UPDATE queries? Is my scaling issue in the code below?:

        $sql = "SELECT ab_id, persistence, count(*) as no_x FROM $query_table ftbl
                WHERE ftbl.$query_col > '$date_before' AND ftbl.$query_col <= '$date_end'
                GROUP BY ab_id, persistence";

        $data_list = DatabaseManager::getResults($sql);

        if (isset($data_list)){
            foreach($data_list as $data){

                $ab_id = $data['ab_id'];
                $no_x = $data['no_x'];
                $measure = $data['persistence'];

                $sql = "SELECT ab_id FROM $rollup_table WHERE ab_id = $ab_id AND rollup_key = '$measure' AND rollup_date = '$day_date'";
                if (DatabaseManager::getVar($sql)){
                        $sql = "UPDATE $rollup_table SET $rollup_col = $no_x WHERE ab_id = $ab_id AND rollup_key = '$measure' AND rollup_date = '$day_date'";
                                DatabaseManager::update($sql);
                } else {
                        $sql = "INSERT INTO $rollup_table (ab_id, rollup_key, $rollup_col, rollup_date) VALUES ($ab_id, '$measure', $no_x, '$day_date')";
                                DatabaseManager::insert($sql);
                }
            }
        }
Kyle Cureau
  • 19,028
  • 23
  • 75
  • 104
  • Have you thought about [replication](http://dev.mysql.com/doc/refman/5.0/en/replication.html)? – Jared Farrish Oct 06 '11 at 02:39
  • How long do your SELECT queries take to process without the UPDATE or INSERT statements? Is it possible that there may be larger bottlenecks with those that may resolve your performance issues? Is it necessary to iterate through your data at the PHP source-code level rather than the MySQL stored procedure level? Is there a necessity to iterate procedurally through your records, or is it possible to accomplish the same thing through set-based operations? – Shan Plourde Oct 06 '11 at 02:45
  • To answer @Shan's question: the SELECT takes a fraction of the time. Do you have an example of how this is done at the "MySQL procedure level" or an example of "set-based" operations? I'm right now looking inot Joey's link – Kyle Cureau Oct 06 '11 at 02:55
  • @Jared, wouldn't replication only help if the SELECT queries were taking a while. Since this has to do with write speed, it would seem I could only have one master at a time anyway – Kyle Cureau Oct 06 '11 at 02:56
  • @JoeyRivera, thanks! that's helps a lot. Maybe it should be an answer and I'll accept it? Unless this is a duplicate, then I'd close it. – Kyle Cureau Oct 06 '11 at 03:10
  • 1
    MySQL stored procedures: http://www.mysqltutorial.org/getting-started-with-mysql-stored-procedures.aspx. What is the percentage breakdown of your queries - by both SQL selects, and the INSERT and UPDATE? It will be easier to help identify solutions with that info. Regarding set-based versus procedural, there are lots of useful resources online that you can read up about this. Here's a SO link to get started: http://stackoverflow.com/questions/24168/why-are-relational-set-based-queries-better-than-cursors. – Shan Plourde Oct 06 '11 at 03:13
  • Thanks @Shan! Looking into your links. It's almost a 50/50 split. I use one major `SELECT` query then do minor ones to see if I should do an `UPDATE` or `INSERT`...using `microtime()` though shows that the lag is almost entirely in the UPDATE part of my code. – Kyle Cureau Oct 06 '11 at 03:33
  • Does your rollup_table have an index that covers the columns from the UPDATE query? Also, the SELECT query ran right before your UPDATE query appears to have an identical WHERE clause. It seems to be a redundancy. If you can get away with only running the WHERE clause once, you will shave a lot of time off your largest bottleneck. – Shan Plourde Oct 06 '11 at 03:40
  • Another thing to consider is what database engine you are using. If you are working with a table that requires a lot of inserts/updates, InnoDB will perform better. MyISAM is good engine for a table that requires lots of selects. And remember you can always mix and match. – Joey Rivera Oct 06 '11 at 13:29

1 Answers1

2

When addressing SQL scaling issues, it is always best to benchmark your problematic SQL. Even at the PHP level is fine in this case, as you're running your queries within PHP.

If your first query could potentially return millions of records, you may be better served running that query as a MySQL stored procedure. That will minimize the amount of data that has to be transferred between database server and PHP application server. Even if both are the same machine, you can still realize a significant performance improvement.

Some questions to consider that may help to resolve your issue follow:

  • How long do your SELECT queries take to process without the UPDATE or INSERT statements?
  • What is the percentage breakdown of your queries - by both SQL selects, and the INSERT and UPDATE? It will be easier to help identify solutions with that info.
  • Is it possible that there may be larger bottlenecks with those that may resolve your performance issues?
  • Is it necessary to iterate through your data at the PHP source-code level rather than the MySQL stored procedure level?
  • Is there a necessity to iterate procedurally through your records, or is it possible to accomplish the same thing through set-based operations?
  • Does your rollup_table have an index that covers the columns from the UPDATE query?
  • Also, the SELECT query ran right before your UPDATE query appears to have an identical WHERE clause. It seems to be a redundancy. If you can get away with only running the WHERE clause once, you will shave a lot of time off your largest bottleneck.

If you're unfamiliar with writing MySQL stored procedures, the process is quite simple. See http://www.mysqltutorial.org/getting-started-with-mysql-stored-procedures.aspx for an example. MySQL has good documentation on this as well. A stored procedure is a program that runs within the MySQL database process, which may help to improve performance when dealing with queries that potentially return millions of rows.

Set-based database operations are often faster than procedural operations. SQL is a set-based language. You can update all rows in a database table with a single UPDATE statement, i.e. UPDATE customers SET total_owing_to_us = 1000000 updates all rows in the customers table, without the need to create a programmatic loop like you've created in your sample code. If you have 100,000,000 customer entries, the set-based update will be significantly faster than the procedural update. There are lots of useful resources online that you can read up about this. Here's a SO link to get started: Why are relational set-based queries better than cursors?.

Community
  • 1
  • 1
Shan Plourde
  • 8,528
  • 2
  • 29
  • 42