-1

This query:

EXPLAIN SELECT ppi_loan.customerID,
               loan_number,
               CONCAT(forename, ' ', surname) AS agent,
               name,
               broker,
              (SELECT timestamp
               FROM ppi_sar_status
               WHERE history = 0
                   AND (status = 10 || status = 13)
                   AND ppi_sar_status.loanID = ppi_loan.loanID) AS ppi_unsure_date,
              fosSent,
              letterSent,
              (SELECT timestamp
               FROM ppi_ques_status
               WHERE status = 1 
                   AND ppi_ques_status.loanID = ppi_loan.loanID
               ORDER BY timestamp DESC LIMIT 1) AS sent_date,
               ppi_ques_status.timestamp
FROM ppi_loan
LEFT JOIN ppi_assignments ON ppi_assignments.customerID = ppi_loan.customerID
LEFT JOIN italk.users ON italk.users.id = agentID
LEFT JOIN ppi_ques_status ON ppi_ques_status.loanID = ppi_loan.loanID
JOIN ppi_lenders ON ppi_lenders.id = ppi_loan.lender
JOIN ppi_status ON ppi_status.customerID = ppi_loan.customerID
JOIN ppi_statuses ON ppi_statuses.status = ppi_status.status
   AND ppi_ques_status.status = 1
   AND ppi_ques_status.history = 0
   AND (cc_type = '' || (cc_type != '' AND cc_accepted = 'no'))
   AND ppi_loan.deleted = 'no'
   AND ppi_loan.customerID != 10
GROUP BY ppi_loan.customerID, loan_number

Is very slow, here are all the results from the EXPLAIN query

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   PRIMARY ppi_ques_status ref loanID,status,history   status  3   const   91086   Using where; Using temporary; Using filesort
1   PRIMARY ppi_loan    eq_ref  PRIMARY,customerID  PRIMARY 8   ppimm.ppi_ques_status.loanID    1   Using where
1   PRIMARY ppi_lenders eq_ref  PRIMARY PRIMARY 4   ppimm.ppi_loan.lender   1   Using where
1   PRIMARY ppi_assignments eq_ref  customerID  customerID  8   ppimm.ppi_loan.customerID   1   
1   PRIMARY users   eq_ref  PRIMARY PRIMARY 8   ppimm.ppi_assignments.agentID   1   
1   PRIMARY ppi_status  ref status,customerID   customerID  8   ppimm.ppi_loan.customerID   6   
1   PRIMARY ppi_statuses    eq_ref  PRIMARY PRIMARY 4   ppimm.ppi_status.status 1   Using where; Using index
3   DEPENDENT SUBQUERY  ppi_ques_status ref loanID,status   loanID  8   func    1   Using where; Using filesort
2   DEPENDENT SUBQUERY  ppi_sar_status  ref loanID,status,history   loanID  8   func    2   Using where

Why is it scanning so many rows and why "Using temporary; Using filesort"? I can't remove any subqueries as I need all of the results that they produce

swdee
  • 33
  • 7
  • It is apparent that you have single field indexes only in the `ppi_ques_status` table, whereas you would need a multi-field index to really help your query. – Shadow Feb 16 '17 at 14:55
  • 1
    It scans that many rows because that many rows satisfy the condition you gave your database. `using where` means it's applying an index to reduce number of rows it has to look up and `using filesort` is there because it's sorting records bit-by-bit to return them in the order you specified (filesort is just badly named). It's slow because it reads from disk (slow) instead of memory (fast). – Mjh Feb 16 '17 at 14:58
  • @mjh your interpretation of the explain is way off the actual meaning. – Shadow Feb 16 '17 at 15:04
  • Can you pls add which table those fields in the `from` clause come from for which you did not indicate the table name, such as `loan_number`? – Shadow Feb 16 '17 at 15:12
  • These are from ppi_loan: loan_number, cc_type, cc_accepted, fosSent, letterSent. "name" is from ppi_lenders. forename and surname from italk.users – swdee Feb 16 '17 at 15:18
  • The results this produces are only 755 – swdee Feb 16 '17 at 15:20
  • @swdee the number of results this query produces has little to do with the number of records scanned, since you have a group by. Are you trying to determine if you contacted your PPI customers as required by the PRA? – Shadow Feb 16 '17 at 15:27
  • Do you know why it's scanning over 90k rows but only producing 755? Also why are you asking about our business? – swdee Feb 16 '17 at 15:43
  • @Shadow - no, you think it's off, it's actually correct. Seeing you haven't proven I'm wrong, I can simply do the same - type that you're wrong. – Mjh Feb 16 '17 at 15:52
  • 1
    @mjh pls see MySQL manual on explain results: https://dev.mysql.com/doc/refman/5.7/en/explain-output.html. For example you wrote that `It scans that many rows because that many rows satisfy the condition you gave your database.` In reality, it means the following: `The rows column indicates the number of rows MySQL believes it must examine to execute the query. For InnoDB tables, this number is an estimate, and may not always be exact.` You can check all the others, you got all of them wrong. – Shadow Feb 16 '17 at 16:20
  • @Shadow - I didn't get them wrong, it appears you've got no idea what `using where` and `using filesort` is nor what it means to inspect rows. Internet is at your disposal, as is the source code, feel free to check it. – Mjh Feb 16 '17 at 16:27
  • @mjh I provided evidence as you asked that you wrote one thing, while the official MySQL documentation says something else. If you believe that I'm wrong, then provide actual evidence to the contrary. – Shadow Feb 16 '17 at 18:20

1 Answers1

1

As already mentioned in a comment, the main cause of a slow query is that you seem to have single column indexes only, while you would need multi-column indexes to cover the joins, the filters, and the group by.

Also, your query has 2 other issues:

  1. Even though you group by on 2 fields only, several other fields are listed in the select list without being subject to an aggregate function, such as min(). MySQL does allow such queries to be run under certain sql mode settings, but they are still against the sql standard and may have unexpected side effects, unless you really know what your are doing.

  2. You have filters on the ppi_loan table in the join condition that is the left table in a left join. Due to the nature of the left join, these records will not be eliminated from the resultset, but MySQL will not join any values on them. These criteria should be moved to the where clause.

The indexes I would create:

  • ppi_sar_status: multi-column index on loanID, status, history fields - I would consider moving this to the join section because this table is not there

  • ppi_ques_status: multi-column index on loanID, status, timestamp fields - this would support both the subquery and the join. Remember, the subquery also has filesort in the explain.

  • ppi_loan: as a minimum a multi-column index on customerID, loan_number fields to support the group by clause, therefore avoiding the filesort as a minimum. You may consider adding the other fields in the join criteria based on their selectivity to this index.

I'm also not sure why you have the last 2 status tables in the join, since you are not retrieving any values from them. If you re using these tables to eliminate certain records, then consider using an exists() subquery instead of a join. In a join MySQL needs to fetch data from all joined tables, whereas in an exists() subquery it would only check if at least 1 record exists in the resultset without retrieving any actual data from the underlying tables.

Shadow
  • 33,525
  • 10
  • 51
  • 64
  • 1
    Queries are slow because they're I/O / CPU bound or when there's a lot of data to be returned, which has to be buffered and then sent through the network. Your answer deals only with indexes. That doesn't do anything for sheer *amount* of data to be sent, only more efficient way of reducing how many I/Os will be wasted to find the records. 90k rows or even 1m rows is nothing for today's CPUs, which clearly indicates that default innodb config is used. Even though indexes could be made more efficient, you focused only on one part of the problem. Also, subquery === JOIN. I won't vote on this. – Mjh Feb 16 '17 at 16:31
  • Thank you Shadow, that's very helpful I'll look at that tomorrow – swdee Feb 16 '17 at 17:16
  • @mjh as the OP indicated, the query produces 755 records, so buffering and ending through the network is not an issue. Subquery is not the same as a join. – Shadow Feb 16 '17 at 18:13
  • Subquery **is** the exact same thing as join. It will be parsed and executed exactly the same as if you were to write a join. In this context, it's irrelevant because answer provided seems sufficient, the rest is measuring e-peen. The way I see it, if extras were included in the answer (configuring MySQL) then this answer would be complete. – N.B. Feb 16 '17 at 19:31
  • @N.B. disagree, subqueries are different from joins. They **may** have the same execution plan as a join, or they may not. You can't really tell. If you use an exists() subquery in place of a join, then most likely they will have a different plan - at least in MySQL. Discussing MySQL configuration is beyond optimising a single query. The number of records scanned and the actual resultset are so small, that MySQL configuration is unlikely to have a major impact on the overall query performance. Whereas removing using temporary, using filesort would have a huge impact on the query performance. – Shadow Feb 16 '17 at 19:42
  • @N.B. see the following SO topic on subqueries vs joins if you do not believe me: http://stackoverflow.com/questions/2577174/join-vs-sub-query – Shadow Feb 16 '17 at 19:47
  • @Shadow the reason I don't believe you or the answer you linked is quite simple - every subquery *can* be rewritten as a JOIN and vice versa. They're interchangeable. Their execution plan **can** be the same. What this discussion boils down to is whether you prefer one over the other. You are smart enough to conclude that MySQL (or any other SQL) developers are faced with same problems as everyone - performance. If you believe that there exists a different execution plan for something that. when translated, equals to join - excellent. I'd be glad if I were wrong. Measurement proves otherwise. – N.B. Feb 16 '17 at 20:09
  • @N.B. I'm sorry, but as the linked topic suggest, there are quite a few people, who disagrees with you. There are quite a few topics even here on SO, where the solution was to replace a join with a subquery or vice versa to speed-up a query. And, no, not all subqueries can be re-written as a join. For example, when a subquery contains aggregates. You have to use a subquery in that case to create a derived table and use a join on that. But, you can't rewrite it without using a subquery. – Shadow Feb 16 '17 at 20:39
  • And if still in doubt, then see what MySQL itself says on subquery vs join performance https://dev.mysql.com/doc/refman/5.7/en/rewriting-subqueries.html : " Also, on some occasions, it is not only possible to rewrite a query without a subquery, **but it can be more efficient to make use of some of these techniques rather than to use subqueries**." This again proves, that it does matter whether you use subuery or a join, therefore the 2 are not the same. – Shadow Feb 16 '17 at 20:49
  • Well, if you're going with assumption that majority is correct, then just remember that majority thought this planet wasn't round. Documentation != source code. But, to make this short - I'll just agree you're correct and please do continue believing you are right. Because, it's totally logical that relational algebra says one thing and that developers tend to use the shortest and most efficient approach, yet you and majority tend to believe that data structures are utilized differently due to slight alteration in syntactic sugar. Have a nice day :) – N.B. Feb 16 '17 at 20:58
  • 1
    @N.B. I'm sorry, but just because you are on the minority side in this argument does not make you automatically right. Relational algebra is one thing, actual execution plan is another. Not to mention the fact, that there are certain subqueries that cannot be re-written as joins without subqueries. – Shadow Feb 16 '17 at 21:06
  • I want to say thank you @Shadow, that has helped. I removed the 2 table joins we weren't using (that was legacy code), changed the AND from the last join to WHERE, and removed the group by as got same results without it, and it's much faster! I didn't want to start changing the indexes though. – swdee Feb 22 '17 at 09:26