1

We have a table MySql , following is the schema

CREATE TABLE campaigns (
  domain varchar(50) ,
  campaign_id bigint(12) ,
  log_time datetime ,
  log_type int,
  node_id bigint(12) 
)

Brief information about table

One domain can have multiple campaigns and one campaign can have multiple nodes

Table having 150 million rows. Unique domains are 40k.

I want to create a composite index on this table to get reports on both campaign level and node level

Suppose if I create composite index like following

KEY campid_domain_nodeid_logtime (`campaign_id`,`domain`,`node_id`,`log_time`)

Does is it fully satisfy following queries that means both on campaign level and node level

Campaign level reports

select count(*) from campaigns 
where domain = 'aaa' and campaign_id = '1235' 
  and log_time between '2016-01-01 00:00:00' and '2016-02-02 00:00:00' 

Node level reports

select count(*) from campaigns
       where domain = 'aaa' and campaign_id = '1235'
         and node_id = '2345'
         and  log_time between '2016-01-01 00:00:00' and '2016-02-02 00:00:00


  

Thanks

Rick James
  • 135,179
  • 13
  • 127
  • 222
Rams
  • 2,141
  • 5
  • 33
  • 59
  • You can go through this link: http://stackoverflow.com/questions/1823685/when-should-i-use-a-composite-index – Adarsh Gangadharan Apr 22 '16 at 13:35
  • 1
    Yes that index will be used by both queries,assuming data distributions allows it.Order of the index columns doenst matter as long as you have AND conditions and equality.Corrections(assuming the WHERE condtions ar as many as the index columns,so the first one wont be able to fully utilise the index – Mihai Apr 22 '16 at 13:51
  • I've fiddle It:http://sqlfiddle.com/#!9/2bade9/1 And it seems to use the index for both. I admin it's a little unexpected. – sab Apr 23 '16 at 09:30

4 Answers4

16

You can think of an index just as a order list with fast lookup. If you have a compound index with fields A,B,C,D, the list will be ordered on A, then for rows with the same A for B, than C, than D.

A1 | B1 | C1 | D1 | -> pointer to row
A1 | B1 | C1 | D2 | -> pointer to row
A1 | B1 | C2 | D1 | -> pointer to row
A1 | B1 | C2 | D2 | -> pointer to row
A1 | B2 | C1 | D1 | -> pointer to row
...
A2 | B1 | C1 | D1 | -> pointer to row
A2 | B1 | C1 | D2 | -> pointer to row

A query optimizer will check your query. If your query asks for A,B,C,D, everything is fine. The order of the query should not matter for a good database, so you can also write the query where D and C and B and A.

If your query ask only for A, again everything is fine, because all rows with the same A are one after the other.

If your query asks only for D, the index is of no use. rows with the same D, but different A are distributed over the whole list.

If your query asks for A, B, D, like your campaign-level report, then the index is of some use. It can be used to speed up the lookup of A and B, but it then needs to iterate over all data because C is missing.

You can define more than one index. The drawback is of course that each additional index makes writing slower and needs some space on the harddrive.

Meier
  • 3,858
  • 1
  • 17
  • 46
  • What about geospatial data? Are multi-column indexes efficient for them? e.g. for a query like: `SELECT * FROM restaurants WHERE latitude > 51.4946 AND latitude < 51.5079 AND longitude > -0.1162 AND longitude < -0.1004;` and an index like `ALTER TABLE restaurants ADD INDEX \`lat_lng\` (lat, lng);`, will the query optimizer perform well. Or, if not, do I need another type of index? My assumption is that as lat/lng coordinates are real numbers, the chance that two or more restaurants having the same latitude and/or longitude is small (only if they are really close). What do you think? Thank you! – tonix Sep 18 '21 at 15:11
  • 1
    @tonix - The optimizer essentially stops when it gets to a range test -- `lat` in this example. I go into more details, plus workarounds in [_Find Nearest_](http://mysql.rjweb.org/doc.php/find_nearest_in_mysql) – Rick James Jul 26 '23 at 07:21
  • But if C is tested with a range, it won't use D. So, your answer is incomplete. – Rick James Jul 26 '23 at 07:23
0

https://dev.mysql.com/doc/refman/5.6/en/multiple-column-indexes.html

The above link says much about the order of indexing during multi column indexes

create index for column in the below order

domain,campaign_id,nodeid,log time

and change the Node Level Report to

select count(*) from campaigns
where domain = 'aaa' and campaign_id = '1235'  
and  log_time between '2016-01-01 00:00:00' and '2016-02-02 00:00:00'
and node_id = '2345'
Andrews B Anthony
  • 1,381
  • 9
  • 27
  • 1
    The order in the query-statement should not matter, as sql is declarative and a query optimizer should reorder it as needed. The problem is the other query, where the node field is missing, but it is in the middle of the index. – Meier Apr 23 '16 at 08:20
0

I recommend 2 indexes:

Campaign:

INDEX 'cdl' (campaign_id, domain_id,  -- either order; first because tested '='
      log_time)                -- last because range test

Node:

INDEX 'cdnl' (campaign_id, domain_id, node_id,  -- any order; tested '='
      log_time)                -- last because range

If 'cdnl' is used for Campaign, it will have to step over the node_ids that are node wanted. (This may, or may not, be a significant burden)

The order of AND clauses in WHERE does not matter. The Optimizer will shuffle as needed while looking for the best INDEX.

But... From the table name, I would guess that campaign_id is the PRIMARY KEY. If it is, starting any secondary index with campaign_id is useless. On the other hand, if some combination of these 4 columns, say 'cdnl', is Unique, then make it the PK and do not add the equivalent secondary index. (You still need the other index.)

More on building an optimal index for a query: Index Cookbook

Nitpick about the range:

log_time between '2016-01-01 00:00:00' and '2016-02-02 00:00:00' 

can be written

log_time between '2016-01-01' and '2016-02-02' 

but both are probably wrong since there is an extra midnight. So I suggest:

    log_time >= '2016-01-01'
AND log_time  < '2016-02-03'   -- (next day)
Rick James
  • 135,179
  • 13
  • 127
  • 222
-1

No, the composite index is not going help any of the 2 queries you listed in this format. Fields in the where criteria need to be in the same order as they are in the index.

I would also change the order of the fields in the index by moving the log_time to the 3rd position:

KEY campid_domain_nodeid_logtime (`campaign_id`,`domain`,`log_time`,`node_id`)

1st query change the order of campaign_id and domain:

select count(*) from campaigns 
where campaign_id = '1235' and domain = 'aaa'
and log_time between '2016-01-01 00:00:00' and '2016-02-02 00:00:00' 

2nd query change the order of campaign_id and domain + node_id and log_time:

select count(*) from campaigns
where  campaign_id = '1235' and domain = 'aaa'
   and  log_time between '2016-01-01 00:00:00' and '2016-02-02 00:00:00' 
   and node_id = '2345'

You can run explain to verify the use of the index. If you have any queries related to the nodes, which do not filter on log_time, then those queries can only usecampaign_id and domain parts of the index.

Shadow
  • 33,525
  • 10
  • 51
  • 64
  • The order of the query sub-clauses in the where statement should not matter. You should say either "A and B" or "B and A", and the query analyser of the database will just see that fields A and B are needed, and will check what indix is available. – Meier Apr 23 '16 at 08:29
  • @shadow - WRONG. The order of items in WHERE does not matter. What matters is that all columns tested with `=` be first in the index. That is, the index will work fine for the Note level report. The campaign level report needs a different index because an extra column (node_id) is in the way. (Meier addresses only part of the issues.) – Rick James Jul 26 '23 at 07:13