-1

I am trying to improve the performance of a query using a "materialized view" to optimize away joins. The first query below is the original, which employs joins. The second is the query written against a table i generated which includes all the joined data (the equivalent of a materialized view). They both return the same result set. Unfortunalatey, somehow, the second query is MUCH slower when handling a very long set of input ids (the IN clause). I don't understand how that could be!!!! Executing all the joins has to have a fair amount of overheat that is saved by the "materialized view", right?

SELECT 
        clinical_sample.INTERNAL_ID AS "internalId",
        sample.STABLE_ID AS "sampleId",
        patient.STABLE_ID AS "patientId",
        clinical_sample.ATTR_ID AS "attrId",
        cancer_study.CANCER_STUDY_IDENTIFIER AS "studyId", 
        clinical_sample.ATTR_VALUE AS "attrValue"
         
        FROM clinical_sample
        INNER JOIN sample ON clinical_sample.INTERNAL_ID = sample.INTERNAL_ID
        INNER JOIN patient ON sample.PATIENT_ID = patient.INTERNAL_ID
        INNER JOIN cancer_study ON patient.CANCER_STUDY_ID = 
                              cancer_study.CANCER_STUDY_ID
         WHERE cancer_study.CANCER_STUDY_IDENTIFIER = 'xxxxx' 
         AND  sample.STABLE_ID IN
                     ('P-0068343-T02-IM7' , 'P-0068353-T01-IM7' ,
                      'P-0068363-T01-IM7' , 'P-0068364-T01-IM7'  )
         AND clinical_sample.ATTR_ID IN
                (
                    'CANCER_TYPE'
                );
SELECT   
        internalId,
        sampleId,
      patientId,
       attrId,
        studyId, 
        attrValue
         
        FROM test
         WHERE 
         sampleId IN ('P-0068343-T02-IM7' , 'P-0068353-T01-IM7' ,
                      'P-0068363-T01-IM7' , 'P-0068364-T01-IM7'  )
         AND studyId = 'xxxxx' 
         AND attrId = 'CANCER_TYPE';

Update: I did notice in Workbench report that the query with joins seems to scan far fewer rows. About 829k vs ~2400k for the second, joinless query. So having joins seems to actually be a major optimization somehow. I have index in sampleId, studyId, attrId and composite of all three.
Both table "test" and "clinical_sample" have the same number of rows.

Rick James
  • 135,179
  • 13
  • 127
  • 222
CpnAhab
  • 159
  • 1
  • 1
  • 6
  • Add a composite index on `(studyId, attrId, sampleId)` to the `test` table. – Barmar Oct 29 '21 at 15:12
  • Thanks @Barmar it has exactly that index already. Also individual indexes on the columns. I did notice in Workbench that the query with joins seems to scan far fewer rows. About 829k vs ~2400k for the first one. – CpnAhab Oct 29 '21 at 15:26

2 Answers2

0

It would help to see what the PRIMARY KEY of each table is.

Some of these indexes are likely to help:

clinical_sample:  INDEX(ATTR_ID, INTERNAL_ID,  ATTR_VALUE)
sample:  INDEX(STABLE_ID, INTERNAL_ID,  PATIENT_ID)
patient:  INDEX(INTERNAL_ID,  STABLE_ID, CANCER_STUDY_ID)
cancer_study:  INDEX(CANCER_STUDY_IDENTIFIER, CANCER_STUDY_ID)

I agree with Barmar's INDEX(studyId, attrId, sampleId) for the materialized view.

I have index in sampleId, studyId, attrId and composite of all three.

Let's see the EXPLAIN. It may show that it is using your index just on (sampleId) when it should be using the composite index.

Also put the IN column last, not first, regardless of cardinality. More precisely, put = columns first in a composite index.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • These indexes are on the original tables, they have no effect on the performance of the materialized view. – Barmar Oct 29 '21 at 15:27
  • Right. The mystery is why the materalized view would be MUCH slower. – CpnAhab Oct 29 '21 at 15:30
  • @Barmar - How could you tell that those indexes were on the original tables?? – Rick James Oct 29 '21 at 15:34
  • I mean you're telling him what indexes should be on the original tables. But he's not working with the original tables any more, so they don't matter. – Barmar Oct 29 '21 at 15:35
  • These indexes would affect the performance of creating the materialized view in the first place, but they don't matter when querying the view. – Barmar Oct 29 '21 at 15:36
  • OK, I am providing info that was not requested. – Rick James Oct 29 '21 at 15:39
  • A little more info: it appears to be only using the index for attrId, NOT the compound index. I deleted all indexes (on the test table) except the compound index. 5 times slower. So it appears it's just not employing the compound index for some reason. – CpnAhab Oct 29 '21 at 15:43
  • @CpnAhab - Are you referring to `clinical_sample` or the View? Please provide `EXPLAIN SELECT ...` I think you were right to get rid of `INDEX(a)` when you also have `INDEX(a,b)`. The former is redundant, yet the Optimizer seems to prefer it because it is "smaller". – Rick James Oct 29 '21 at 16:09
  • Appears to not use the compound index at all. I even attempted to force it FROM test USE INDEX(total) @RickJames this is explains: '1','SIMPLE','test',NULL,'ALL','total',NULL,NULL,NULL,'14360648','0.50','Using where'. – CpnAhab Oct 29 '21 at 16:19
  • @CpnAhab - Which `SELECT` gave that `EXPLAIN`? Let's see the full definition of the materialized view. That is, please provide the `SHOW CREATE TABLE` for the table that was created from that view. (MySQL has no "materialized" views, only views that you materialize into a table.) – Rick James Oct 29 '21 at 16:31
  • @RickJames Here is a DESCRIBE on the table. Yes, I use "materialized view" conceptually. The table is very simple. In general would you agree that it is possible for a JOIN strategy to vastly outperform a query on a simple table of denormalized data? Does that make any sense? 'id','int(11)','NO','PRI',NULL,'auto_increment' 'internalId','int(11)','NO','',NULL,'' 'sampleId','varchar(50)','NO','MUL',NULL,'' 'patientId','varchar(50)','NO','',NULL,'' 'attrId','varchar(255)','NO','',NULL,'' 'studyId','varchar(255)','YES','',NULL,'' 'attrValue','varchar(255)','NO','',NULL,'' – CpnAhab Oct 29 '21 at 16:56
  • `DESCRIBE` is less descriptive than `SHOW CREATE TABLE`. For one thing, `DESCRIBE` does not clearly identify all the indexes. Though perhaps it does say that you have not added Barmar's index. – Rick James Oct 29 '21 at 17:00
0

Food for thought: When and why are database joins expensive? this leads me to believe that normalized tables with indexes could actually be faster than my denormalized attempt (materialized view).

CpnAhab
  • 159
  • 1
  • 1
  • 6