0

I have a table with following columns:

CREATE TABLE some_table (
    col_1 ...
    col_2 ...
    col_3 ...
    col_4 ...
    col_5 ...

    # some more columns
) ENGINE = InnoDB DEFAULT CHARSET = utf8 DEFAULT COLLATE utf8_unicode_ci;

and I perform the following search:

SELECT * FROM some_table t
WHERE (t.col_1 LIKE :col1
    OR t.col_2 = :col2
    OR t.col_3 = :col3
    OR t.col_4 = :col4)
AND t.col_5 > :col5;

I populate :col1, :col2, etc. within the code. So ideally my index would be:

ALTER TABLE some_table
ADD INDEX my_index (col_1, col_2, col_3, col_4, col_5);

However, a new requirement showed up, and when :col1, col2, etc. are NULL I need to skip them. Two alternatives are present:

  1. Make the SQL Query dynamic with multiple indexes:

I can query like:

SELECT * FROM some_table t
WHERE (t.col_1 LIKE :col1
    -- OR t.col_2 LIKE :col2 -- Remove these columns from query if they're null
    -- OR t.col_3 LIKE :col3 -- Remove these columns from query if they're null
    OR t.col_4 LIKE :col4)
AND t.col_5 LIKE :col5;

but then I would need a lot more indexes because an index with col_1, col_2, col_3, col_4, col_5 would cover say a col_1, col_2, col_3 query but not col_1, col_3, col_4 query as per here and here. So my indexes become:

ALTER TABLE some_table
ADD INDEX index_1 (col_1, col_2, col_3, col_4, col_5),

ADD     INDEX index_2(col_1, col_2, col_3, col_5),
ADD     INDEX index_3(col_1, col_2, col_4, col_5),
ADD     INDEX index_4(col_1, col_3, col_4, col_5),
ADD     INDEX index_5(col_2, col_3, col_4, col_5),

ADD     INDEX index_6 (col_1, col_2, col_5),
ADD     INDEX index_7 (col_1, col_3, col_5),
ADD     INDEX index_8 (col_1, col_4, col_5),
ADD     INDEX index_9 (col_2, col_3, col_5),
ADD     INDEX index_10 (col_2, col_4, col_5),
ADD     INDEX index_11 (col_3, col_4, col_5),

ADD     INDEX index_12 (col_1, col_5),
ADD     INDEX index_13 (col_2, col_5),
ADD     INDEX index_14 (col_3, col_5),
ADD     INDEX index_15(col_4, col_5),
ADD     INDEX index_16 (col_5);

I do not really care about my read performance, but a lot of storage overhead might be an issue in the long term.

  1. Second alternative: Keep the single index (my_index) as is, but change query:
SELECT * FROM some_table t
WHERE ((t.col_1 IS NOT NULL AND t.col_1 LIKE :col1)
    OR (t.col_2 IS NOT NULL AND t.col_2 LIKE :col2)
    OR (t.col_3 IS NOT NULL AND t.col_3 LIKE :col3)
    OR (t.col_4 IS NOT NULL AND t.col_4 LIKE :col4))
AND t.col_5 LIKE :col5;

I don't know if it would be enough to trick MySQL to use the index because if say :col1 is indeed NULL the condition becomes: t.col_2 IS NOT NULL AND t.col_2 LIKE NULL and since this is an impossible condition, the optimizer could remove it.

With the data currently in hand, both queries run about the same time. What is the correct way to proceed with this?

Hasan Can Saral
  • 2,950
  • 5
  • 43
  • 78
  • @Akina Hi, thanks for the comment. Can you elaborate on *normalize*? The problem is that the data is not mine, so I have very few options (like creating indexes) for the moment. – Hasan Can Saral Jan 15 '21 at 12:51
  • @Akina, sorry was not specific enough, The columns are not dependent, to be exact with the context, in an online payment transaction, columns are, `merchantId`, `customerId`, `transactionDate`... – Hasan Can Saral Jan 15 '21 at 12:58
  • `col5` is the only part that will be used; putting it last makes the index useless. See my answer for more discussion. – Rick James Jan 15 '21 at 18:52

2 Answers2

1

OR won't use an index (except in the case of "covering"). So, don't bother with those multi-column indexes.

A trick to avoid OR is to use UNION. More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#or

You are using LIKE. Does this mean that the user might say LIKE '%abc'? If so, that also cannot use an index, even with the UNION technique. That is, if you discover any fill-ins that start with a wildcard, then your original OR is probably optimal (though slow).

Are the search values "words"? In particular, the same word? If so, there is an excellent (though not perfect) solution:

FULLTEXT(col1, col2, col3, col4)

together with

MATCH(col1, col2, col3, col4) AGAINST("+word" IN BOOLEAN MODE)

That will be faster than any other approach, but it depends on the search strings being "words" and of a minimal length and not being "a stop word".

Whatever approach you settle on, I recommend you "construct" the query in your app so that you can skip col2 if the user does not need to test with it, etc.

As for AND col5 LIKE ... -- Again this is costly if it starts with a wildcard. If it does not, then this will be beneficial:

INDEX(col5)

If no wildcard, then that is equivalent to AND col5 = '...', which optimizes nicely. If it ends with a wildcard, it is a "range", somewhat like BETWEEN. This is also efficient.

If you have both:

WHERE MATCH(col1, col2, col3, col4) AGAINST("+word" IN BOOLEAN MODE)
  AND col5 ...

The Optimizer will perform the MATCH first; any index on col5 or leading wildcard won't matter.

Since the columns are merchantId, customerId, transactionDate, etc it further behooves you to "construct" the query. If it looks like a date, then test only transactionDate. That gets rid of OR and leads to possibly having INDEX(transactionDate, col5). (We can give you better advice if you are more specific with the column names and types). Also, are those Ids numbers or strings? What are some examples of where your LIKE might involve a wildcard? Or do the users even know about such?

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I'm always looking for an exact match (apart from `transactionDate`, which is `col_5`, and the condition is `>`). So I interpret this as: I should convert to `UNION`s, with `INDEX (col_1, transaction_date)`, `INDEX (col_2, transaction_date)`, `INDEX (col_3, transaction_date)` and `INDEX (col_4, transaction_date)`. Would another `INDEX (transaction_date)` help? – Hasan Can Saral Jan 15 '21 at 22:17
  • Not performing the `UNION` for the columns where `:colN` is `null` will also take care of handling of the `null` values. – Hasan Can Saral Jan 15 '21 at 22:22
  • @HasanCanSaral - Yes. That is part of "constructing" the query -- leave out stuff that is not needed based on user input. And, yes; build a suitable index for each of the `SELECTs` in the `UNION`; and they will usually be different. – Rick James Jan 16 '21 at 08:08
0

Just a suggestion A part the problems related to a propably bad attribute design .. A lot of your indexes are totally useless

ADD     INDEX index_1 (col_1, col_2, col_3, col_4, col_5),

ADD     INDEX index_2(col_1, col_2, col_3, col_5),
ADD     INDEX index_3(col_1, col_2, col_4, col_5),
ADD     INDEX index_4(col_1, col_3, col_4, col_5),
ADD     INDEX index_5(col_2, col_3, col_4, col_5),

ADD     INDEX index_6 (col_1, col_2, col_5),
ADD     INDEX index_7 (col_1, col_3, col_5),
ADD     INDEX index_8 (col_1, col_4, col_5),

all these indexes work in the same way ..

the db engine use a index evaluating the columns from left to right .. so you are just replicate several time the same value without any effort for performance

if you want better performance you should change the left to right column order eg_

ADD     INDEX index_2(col_2, col_3, col_4, col_1),
ADD     INDEX index_3(col_3, col_4, col_5, col_1),

anyway you should add the index for the more frequente and selective column

ScaisEdge
  • 131,976
  • 10
  • 91
  • 107
  • Thank you for the answer. So indexes with 5 columns for every possible sequence would cover all my queries? – Hasan Can Saral Jan 15 '21 at 12:19
  • Pratically yes .. even the concept is more complex .. anyway .. i suggest you a review for table struct ..and normalize these columns in a proper relation – ScaisEdge Jan 15 '21 at 12:22
  • Can you elaborate on the meaning of **normalize**? Thanks. – Hasan Can Saral Jan 15 '21 at 12:26
  • depend of the meaning of the columns ..if the columns refer to data complety different then the normalize in not necessary . if the columns are value repetition of same contents then you should create a detail table for one or more vale related to you main . record .. but without a real data sample is not possible answer properly . – ScaisEdge Jan 15 '21 at 12:55
  • They are absolutely independent. To be exact with the context, in an online payment transaction, columns are, `merchantId`, `customerId`, `transactionDate`... – Hasan Can Saral Jan 15 '21 at 12:58
  • if so you don't need normalization .. .. just a proper indexing – ScaisEdge Jan 15 '21 at 13:00
  • Yes, the question is, what is the proper indexing, without creating `5!` indexes? – Hasan Can Saral Jan 15 '21 at 13:02
  • in db tuning are indexed primarly the columns most used in query .. so you need an eval for these columns .. and after some test are inetgrated with others index based on residual most used columns – ScaisEdge Jan 15 '21 at 13:12
  • Start with the `SELECTs, UPDATEs, DELETEs` that are important; _they_ should drive the choice of indexes. – Rick James Jan 17 '21 at 00:43