2

I am working on a data analytics Dashboard for a media content broadcasting company. Even if a user clicks a certain channel, logs/records are stored into MySQL DB. Following is the table that stores data regarding channel play times.

Here is the table structure:

 _____________________________________
|           ID INT(11)                |
 _____________________________________
|        Channel_ID INT(11)           |
 _____________________________________
|       playing_date (DATE)           |
 _____________________________________  
|      country_code VARCHAR(50)       | 
 _____________________________________
|      playtime_in_sec INT(11)        | 
 _____________________________________
| count_more_then_30_min_play INT(11) | 
 _____________________________________
|    count_15_30_min_play INT(11)     | 
 _____________________________________
|       count_0_15_min_play           | 
 _____________________________________
|   channel_report_tag VARCHAR(50)    |
 _____________________________________ 
|   device_report_tag VARCHAR(50)     |
 _____________________________________ 
|   genre_report_tag VARCHAR(50)      |
 _____________________________________

The Query that I run behind one of the dashboard graphs construction is :

    SELECT 
        channel_report_tag,
        SUM(count_more_then_30_min_play) AS '>30 minutes', 
        SUM(count_15_30_min_play) AS '15-30 Minutes', 
        SUM(count_0_15_min_play) AS '0-15 Minutes'
    FROM 
        channel_play_times_cleaned 
    WHERE 
        playing_date BETWEEN '' AND ''
        AND country_code LIKE ''        
        AND device_report_tag LIKE '' 
        AND channel_report_tag LIKE  ''
    GROUP BY 
        channel_report_tag
    LIMIT 10

This query basically is taking a lot of time to return the result set (given the table data exceeds a million records per day and increasing every second ). I came across this stack-overflow Question : What generic techniques can be applied to optimize SQL queries? which basically mentions employing indices as one the techniques to optimize SQL queries. At the moment I am confused how to apply indices (i.e on what columns) in order to optimize the above mentioned query. I would be very grateful if some one could offer help in creating indices according to my specific scenario. Any other expert opinion for a beginner like me are surely welcomed.

EDIT :

As suggested by @Thomas G ,

I have tried to improve my query and make it more specific :

SELECT 
        channel_report_tag,
        SUM(count_more_then_30_min_play) AS '>30 minutes', 
        SUM(count_15_30_min_play) AS '15-30 Minutes', 
        SUM(count_0_15_min_play) AS '0-15 Minutes'
    FROM 
        channel_play_times_cleaned 
    WHERE 
        playing_date BETWEEN '' AND ''
        AND country_code = 'US'        
        AND device_report_tag = 'j8' 
        AND channel_report_tag = 'NAT GEO'
    GROUP BY 
        channel_report_tag
    LIMIT 10
Community
  • 1
  • 1
Danish Bin Sofwan
  • 476
  • 1
  • 6
  • 21
  • you have to do a indexing of the column ,For that first Alter a Table structure , Just Before 'Apply , Revert' button , there is a tab section , where you have to select a 'Indexes' tab , In this tab select the column which you want to index and then apply it – chirag satapara Sep 16 '16 at 10:46
  • @chiragpatel "how to apply an index" that I know. What I am asking is on which columns to apply w.r.t the query I have mentioned. Thanks any ways for the concern. – Danish Bin Sofwan Sep 16 '16 at 10:55
  • Indexing is apply on that columns which is in where condition and if that column datatype is Integer then it perform fast. – chirag satapara Sep 16 '16 at 11:08
  • Can you add example values for your `like`-code? If you use `like '%xxx%'` (so the `%` infront), indexes won't speed this up. So you are probably left with an index on `playing_date`, maybe `playing_date, channel_report_tag`. – Solarflare Sep 16 '16 at 11:09
  • @DanishBinSofwan https://dev.mysql.com/doc/refman/5.6/en/innodb-fulltext-index.html , Please refer this link , it will be very helpful to you – chirag satapara Sep 16 '16 at 11:12
  • 1
    Why 'LIKE'?!?!?!? How can a country code be LIKE something ?!?! – Strawberry Sep 16 '16 at 11:48
  • @Solarflare AND country_code LIKE 'US' AND device_report_tag LIKE 'j8' AND channel_report_tag LIKE 'NAT GEO' – Danish Bin Sofwan Sep 16 '16 at 12:24

2 Answers2

1

I started to write this in a comment because these are hints and not a clear answer. But that's way too long

First of all, it is common sense (but not always a rule of thumb) to index the columns appearing in a WHERE clause :

   playing_date BETWEEN '' AND ''
    AND country_code LIKE ''        
    AND device_report_tag LIKE '' 
    AND channel_report_tag LIKE  ''

If your columns have a very high cardinality (your tag columns???), it's probably not a good idea to index them. Country_code and playing_date should be indexed.

The issue here is that there are so many LIKE in your query. This operator is perf a killer and you are using it on 3 columns. That's awfull for the database. So the question is: Is that really needed?

For instance I see no obvious reason to make a LIKE on a country code. Will you really query like this :

AND country_code LIKE 'U%'

To retrieve UK and US ?? You probably won't. Chances are high that you will know the countries for which you are searching for, so you should do this instead :

AND country_code IN ('UK','US')

Which will be a lot faster if the country column is indexed

Next, If you really want to make LIKE on your 2 tag columns, instead of doing a LIKE you can try this

AND MATCH(device_report_tag) AGAINST ('anything*' IN BOOLEAN MODE)

It is also possible to index your tag columns as FULLTEXT, especially if you search with LIKE ='anything%'. I you search with LIKE='%anything%', the index won't probably help much.

I could also state that with millions rows a day, you might have to PARTITION your tables (on the date for instance). And following your data, a composite index on the date and something else might help.

Really, there's no simple and straight answer to your complex question, especially with what you shown (not a lot).

Thomas G
  • 9,886
  • 7
  • 28
  • 41
  • These are hints but very helpful , thanks. And basically reason for using LIKE is because the country_code, device_report_tag and channel_report_tag are controlled by dashboard parameters which the user inputs. But your point regarding country code is genuine and I can use IN over there instead of LIKE. – Danish Bin Sofwan Sep 16 '16 at 12:59
1

Separate indexes are not as useful as composite indexes. Unfortunately, you have many possible combinations, and you are (apparently) allowing wildcards, which may destroy the utility of indexes.

Suggest you use client code to build the WHERE clause rather than populating it with ''

In composite indexes, put one range last. date BETWEEN ... AND ... is a "range".

LIKE 'abc' -- same as = 'abc', so why not change to that.
LIKE 'abc%' -- is a "range"
LIKE '%abc' -- can't use an index.
IN ('CA', 'TX')  -- sometimes optimizes like '=', sometimes like 'range'.

So... Watch what queries the users ask for, then build composite indexes to satisfy them. Some rules:

  • At most one range, and put it last.
  • Put '=' column(s) first.
  • INDEX(a,b) is handled by INDEX(a,b,c), so include only the latter.
  • Don't have more than, say, a dozen indexes.

Index Cookbook

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • What if I use this query, replace LIKE with = operator. Kindly have a look at my question edit. – Danish Bin Sofwan Sep 17 '16 at 07:40
  • For clarity, use `=` when you know that is what you want. Without wildcards `=` and `LIKE` perform very similarly. If the end-user _can_ provide wildcards then is reasonably to simply use `LIKE` and let the optimizer realize that it can improve on it. – Rick James Sep 17 '16 at 21:38