0

I have a table (MYSQL 8) with apx. 100M records storing various bits of stock data (price, date, etc.), and query A below runs in < 1s, but query B takes over 2mins. Among other indices, I've got an index on the date, and the primary key for the table is (symbol, date). What would cause such a significant difference between the two queries, and what might speed up the poor performer?

Query A:

SELECT symbol, MIN(date)
FROM Stocks
WHERE date BETWEEN '2015-01-01' AND '2020-01-01'
GROUP BY symbol

Query B

SELECT symbol, MIN(date)
FROM Stocks
WHERE date BETWEEN '2015-01-01' AND '2020-01-01' AND market_cap > 20
GROUP BY symbol

The other challenge I'm facing is that at times I want to filter by market_cap, but other times by other numerical fields (gross_profit, total_assets, etc.). The query is being generated by a form with a number of optional inputs that are tied to params.

Table schema

CREATE TABLE IF NOT EXISTS cri_v0_995 (
  id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  company_id MEDIUMINT UNSIGNED NOT NULL, 
  dt DATE NOT NULL,
  price DECIMAL(18, 2),
  market_cap DECIMAL(12, 4),
  div_yield DECIMAL(4,2), -- this is always 0
  price_to_nte DOUBLE,
  price_to_mte DOUBLE,
  nte_to_price DECIMAL(16, 10),
  ante_to_price DECIMAL(16, 10),
  ate_to_price DECIMAL(18, 10),
  price_to_sales DOUBLE,
  price_to_earnings DOUBLE,
  cur_liq_to_mcap DECIMAL(4, 2), -- this is always 0
  net_liq_to_mcap DOUBLE,
  g3_rev_growth_and_trend DECIMAL(14, 10),
  p_cri_score DECIMAL(14, 10),
  f_cri_score DECIMAL(10, 7),
  cri_score DECIMAL(14, 10),
  PRIMARY KEY (id),
  FOREIGN KEY (company_id) REFERENCES companies (id),
  UNIQUE KEY (company_id, dt)
);

Note that there are a couple cols that I'm unsure about. They've always been zeros but I don't know what the intent may be behind them atm.

(edit 1 to address missing GROUP BYs) (edit 2 adding table schema)

0x11
  • 173
  • 8
  • 2
    Your queries are malformed. They are aggregation queries with no `group by`, but you have unaggregated columns in the `select`. – Gordon Linoff Dec 02 '20 at 02:59
  • Yes, both queries are malformed. That means they are returning random results, and you probably are not aware of that. **Do not use these queries in production**. – The Impaler Dec 02 '20 at 03:02
  • Shoot, thanks Gordon and The Impaler. The `GROUP BY` is part of the queries I'm testing, I just copied them over here poorly. – 0x11 Dec 02 '20 at 03:10
  • Obviously you wanted to index market_cap – Murat Tutumlu Dec 02 '20 at 03:28
  • Is it reasonable to have an index on `market_cap` alone? And if I've got ~20 fields I'm sometimes attempting to filter by, does adding an index for each make sense? – 0x11 Dec 02 '20 at 03:29
  • 1
    Your date range is 5 years plus 1 day long. Beware of `BETWEEN`. – Rick James Dec 08 '20 at 18:47

3 Answers3

1

The first query could simply hop through the table:

  • For each symbol (which is conveniently the start of the PK)
  • Find the first row ("first" because of the second part of the index is date) that is also >= start date
  • Toss the result if that date is not <= the end date

The second query needs to look at each row to check market_cap; it can't jump through the table.

If, instead, you have current_market_cap in the Symbols table you could filter on market_cap before JOINing to this table.

Two ranges in the WHERE clause makes it very difficult to optimize. INDEXes are one-dimensional.

Using PARTITION BY TODAYS(date) requires a major structural change to the table. It may (or may not!) help your query run faster -- by using 'partition pruning' to limit how many rows need to be checked. (I say "may not" because the query is looking at a 5-year range, which might be a significant fraction of the entire data.)

More discussion of partitioning: http://mysql.rjweb.org/doc.php/partitionmaint and http://mysql.rjweb.org/doc.php/find_nearest_in_mysql -- The latter link discusses a different 2D problem (geographical 'find nearest'); it is something of a stretch to apply it to your query.

Since you have lots of columns that the end-user might filter on, and 100M rows, let's approach from another direction: Minimizing table size. This is especially important if the table cannot be fully cached in the buffer_pool -- leading to being I/O-bound. Show us SHOW CREATE TABLE; let's discuss each column, and whether it can be shrunk.

More

  • Changing symbol VARCHAR... to company_id MEDIUMINT UNSIGNED may have saved 1GB between the data and the index.
  • Get rid of id and promote UNIQUE(company_id, dt) to be the PK. That will save a few GB by eliminating the only secondary index. (Your change was probably beneficial.
  • Most of those DOUBLEs are overkill? FLOAT would save 4 bytes each and still give you 6-7 significant digits.
  • You may want INDEX(dt) for some other queries.
  • The filter on market_cap probably gets in the way of groupwise max optimization.
  • Depending on disk space and other queries, it may be beneficial to PARTITION BY RANGE(TO_DAYS(dt)), but group by years. The (5 year + 1 day) span would hit 6 partitions. (cf "partition pruning") This would not actually change performance much.
  • (About 18 years ago, I worked with a dataset like this.)
  • price DECIMAL(18, 2) takes 9 bytes. It allows for a zillion dollars, which has not [yet] been reached. It has only 2 decimal places, so it won't precisely hold amounts before they switched to decimal (from /2, /4, /8, /16, etc).
  • market_cap DECIMAL(12, 4) (6 bytes) may not be big enough for some companies, and certainly not for indexes. And the 4 decimal places is probably a waste.
  • Suggest running SELECT MAX(market_cap), MAX(price), ... to see how big the numbers are now.
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thanks for your thoughts. Actually happened upon that second link this afternoon looking for answers to this but having trouble applying it directly. Added a schema definition, have made a couple changes to use a separate compnanies table and a FK rather than having symbol be the key -- not sure it was a move in the right direction or not. Anyways, hopeful that you have some insight! – 0x11 Dec 10 '20 at 01:06
  • @0x11 - I added a bunch more. – Rick James Dec 10 '20 at 02:07
  • Regarding doubles being overkill, some of the values have 10 significant digits. Would they fit in a float? Though now I'm also wondering how significant those last few places might be in the first place. Think I'm with you. – 0x11 Dec 10 '20 at 02:35
  • Could you elaborate on market cap getting in the way of groupwise optimization? – 0x11 Dec 10 '20 at 02:36
  • I want to test partitioning. Are you saying the partition needs a GROUP BY? I'm still trying to wrap my head around partitions. – 0x11 Dec 10 '20 at 02:38
  • `FLOAT` can handle only about 7 significant digits. To argue further... But would you display miles-per-hour to more than 2 significance? Or temperature to more than 3? Market cap can be computed to lots of digits, but most of it is noise; "123 million" is good enough, after all, tomorrow it may be 129 million. – Rick James Dec 10 '20 at 05:38
  • Tricks can be played for MAX(..) GROUP BY, but not when a "range" is included in the `WHERE`. I don't think I can explain it. – Rick James Dec 10 '20 at 05:40
  • `PARTITION` splits a table into "sub tables". `GROUP BY` is not related. A `SELECT` gathers rows from one or more partitions perhaps with the aid of an index, combines the results, then can apply grouping, ordering, and limiting. – Rick James Dec 10 '20 at 05:42
  • When you say 'hop through the table', are you suggesting using a loop and temp table, or a cursor, or multiple queries and stitching together in the application? – 0x11 Dec 10 '20 at 13:56
  • @0x11 - None of the above -- The Optimizer knows how to do this one case [at least]: `SELECT x, MIN(y) FROM..GROUP BY x` when it has `INDEX(x,y)` (or `MAX`)-- Namely, drill down the BTree for the first (or last) entry for each `x`. That's what I mean by "hop". (There is a fancy name for it, but I forget.) `AND market_cap>20` destroys the pattern, so the optimization cannot be applied. – Rick James Dec 10 '20 at 16:42
0

For both queries, you probably should be aggregating by the symbol. So, the second currently non performant query should be:

SELECT symbol, MIN(date)
FROM Stocks
WHERE date BETWEEN '2015-01-01' AND '2020-01-01' AND market_cap > 20
GROUP BY symbol;

The index you want here should at least cover the entire WHERE clause:

CREATE INDEX ON Stocks (date, market_cap);

If you run EXPLAIN on both queries, after adding GROUP BY, you might find that your current single column index on date isn't even being used in the second query.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Sorry about that. The queries in question do in fact have the proper `GROUP BY symbol` clause, just missed it in copying over. – 0x11 Dec 02 '20 at 03:11
  • That said, part of the complication lies in the fact that I'm sometimes using `market_cap` and other times any number of other parameters (`price`, `gross_profit`, etc.) which makes adding the appropriate indices challenging. – 0x11 Dec 02 '20 at 03:13
  • Sure. But that's another question and not what you actually asked. Your question above has been answered. – Tim Biegeleisen Dec 02 '20 at 03:14
  • Gotcha..any advice? – 0x11 Dec 02 '20 at 03:17
  • You need to write out the queries and then see which columns are the most restrictive. It could be that one index can cover all queries, or maybe you need more than one index. – Tim Biegeleisen Dec 02 '20 at 03:24
  • Thanks, think I get what you're saying. To clarify, are you saying it might be the case that `CREATE INDEX ON Stocks(date, market_cap, price, gross_profit)` will benefit all queries enough to solve my problem, but that its also possible that I'll need a few more specific indexes to handle differences? I guess that's where I'm a bit confused -- could you clarify what you mean by "most restrictive"? – 0x11 Dec 02 '20 at 03:34
  • 1
    @0x11 Typically when choosing a multi-column index, we put the most restrictive columns first (i.e. highest cardinality). – Tim Biegeleisen Dec 02 '20 at 04:06
  • @0x11 - I disagree with Tim: https://stackoverflow.com/questions/50239658/higher-cardinality-column-first-in-an-index-when-involving-a-range – Rick James Dec 10 '20 at 02:06
0

maybe you need to force using index in the sql Query B.

SELECT symbol, MIN(date)
FROM Stocks use index (`indexNameOfDate`)
WHERE date BETWEEN '2015-01-01' AND '2020-01-01' AND market_cap > 20
GROUP BY symbol

or you can force to use primaryKey index.

doing that may save times for sql engine choosing index itself. and you can find which is faster as well.

what's more, if u are using date and market_cap to filter data usually, maybe you need to create a index cover them.

like @Tim Biegeleisen said.

CREATE INDEX ON Stocks (date, market_cap);

tomy0608
  • 317
  • 1
  • 9
  • I tried `FORCE INDEX (PRIMARY)`, since the pk is `(symbol, date)`, but that didn't seem to have an effect. Mentioned in response to @Tim that I'm unsure whether creating an index is the right answer for me given that I've got a number of different params being used to filter, and would theoretically need an index for each of them to address the issue. Any thoughts on whether or not adding that many indices is a reasonable thing to do? – 0x11 Dec 02 '20 at 03:28
  • have you tried just force index (date_index), not primay? it does not make sense create index for every field alone if you have many fields to filter by , which used not frequently – tomy0608 Dec 02 '20 at 03:39
  • No, don't have one specifically on date because query A ran so fast only using the primary. Do you think adding a date specific index might help? – 0x11 Dec 02 '20 at 03:41
  • Forcing MySQL to use an index is generally not a good idea. – Tim Biegeleisen Dec 02 '20 at 04:06