5

I am working with MySQL 5.6. I had created a table with 366 partitions to save data daywise means In a year we have maximum 366 days so I had created 366 partitions on that table. The hash partitions were managed by an integer column which stores 1 to 366 for each record.

Report_Summary Table:

CREATE TABLE `Report_Summary` (
  `PartitionsID` int(4) unsigned NOT NULL,
  `ReportTime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `Amount` int(10) NOT NULL,
  UNIQUE KEY `UNIQUE` (`PartitionsID`,`ReportTime`),
  KEY `PartitionsID` (`PartitionsID`),
  KEY `ReportTime` (`ReportTime`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPRESSED
/*!50100 PARTITION BY HASH (PartitionsID)
PARTITIONS 366 */

My current query:

SELECT DATE(RS.ReportTime) AS ReportDate, SUM(RS.Amount) AS Total
FROM Report_Summary RS
WHERE RS.ReportTime >= '2014-12-26 00:00:00' AND RS.ReportTime <= '2014-12-30 23:59:59' AND 
      RS.PartitionsID BETWEEN DAYOFYEAR('2014-12-26 00:00:00') AND DAYOFYEAR('2014-12-30 23:59:59')
GROUP BY ReportDate; 

The above query is perfectly working and using partitions p360 to p364 to fetch the data. Now the problem is when I pass the fromDate to '2014-12-26' and toDate to '2015-01-01' Then above query won't work. Because the Day of year for '2015-01-01' is 1 so my conditions got failed.

Now I had tried for passing the value in IN operator then it works perfectly in database check below query:

SELECT DATE(RS.ReportTime) AS ReportDate, SUM(RS.Amount) AS Total
FROM Report_Summary RS
WHERE RS.ReportTime >= '2014-12-26 00:00:00' AND RS.ReportTime <= '2015-01-01 23:59:59' AND 
      RS.PartitionsID IN (360,361,362,363,364,365,1)
GROUP BY ReportDate; 

To generate above scenario I had created a function and passed two dates and generate a comma seperated string of the IDs

SELECT GenerateRange('2014-12-26 00:00:00', '2015-01-01 23:59:59');

Which reurns me data as:

'360,361,362,363,364,365,366,1'

And I tried to use that function in my query so I had changed my query as below:

SELECT DATE(RS.ReportTime) AS ReportDate, SUM(RS.Amount) AS Total
FROM Report_Summary RS
WHERE RS.ReportTime >= '2014-12-26 00:00:00' AND RS.ReportTime <= '2015-01-01 23:59:59' AND 
      FIND_IN_SET(RS.PartitionsID, GenerateRange('2014-12-26 00:00:00', '2015-01-01 00:00:00'))
GROUP BY ReportDate; 

Then I had checked the execution plan of above query using EXPLAIN PARTITION SELECT.... And I found thet my condition won't work. It uses all partitions to fetch data. I want to use the specific partitions of those dates only. It must be check only these 360,361,362,363,364,365,366,1 partitions means p360 to p366 and p1.

Why my query is not working? And this is not right way to implement this then I want solution How can I achieve this?

I know from coding I can implement this but I have to write a query to implement this.

Thanks...

Saharsh Shah
  • 28,687
  • 8
  • 48
  • 83
  • What are you trying to do with your request ? Which kind of result are you waiting for ? – akmozo Jan 04 '15 at 17:41
  • @akmozo I want a working query which will utilize partitions which that condition required. But my current qury with condition utilizing all partitions which is wrong. – Saharsh Shah Jan 05 '15 at 05:53
  • We are you partitioning by DAYOFYEAR? Are you going to be having many queries that are comparing a single day in the year to a single day in prior year(s)? If not then you are probably better off partitioning by Year or YearMonth or something sequential if your queries will mostly be sequential data access. – BateTech Jan 07 '15 at 12:20
  • @BateTech The partitions were already created so I need to use that because I can't change that. – Saharsh Shah Jan 07 '15 at 12:57
  • @SaharshShah I have updated my answer and added an "option 3" that may work for you since it does not use an `OR` in the where clause. – BateTech Jan 07 '15 at 14:26
  • @SaharshShah, when you do `EXPLAIN PARTITION` on the query with the `PartitionsID IN(hard coded list)` operator, the second query in your question, do you find that it uses your partitions correctly? Please clarify. – O. Jones Jan 08 '15 at 15:04

4 Answers4

1

There are a few options that I can think of.

  1. Create case statements that cover multi-year search criteria.
  2. Create a CalendarDays table and use it to get the distinct list of DayOfYear for your in clause.
  3. Variation of option 1 but using a union to search each range separately

Option 1: Using case statements. It is not pretty, but seems to work. There is a scenario where this option could search one extra partition, 366, if the query spans years in a non-leap year. Also I'm not certain that the optimizer will like the OR in the RS.ParitionsID filter, but you can try it out.

SELECT DATE(RS.ReportTime) AS ReportDate, SUM(RS.Amount) AS Total
FROM Report_Summary RS
WHERE RS.ReportTime >= @startDate AND RS.ReportTime <= @endDate
    AND 
    (
    RS.PartitionsID BETWEEN 
        CASE 
            WHEN
                --more than one year, search all days 
                year(@endDate) - year(@startDate) > 1
                --one full year difference 
                OR year(@endDate) - year(@startDate) = 1 
                    AND DAYOFYEAR(@startDate) <= DAYOFYEAR(@endDate)
            THEN 1
            ELSE DAYOFYEAR(@startDate)
        END
        and 
        CASE
            WHEN 
                --query spans the end of a year
                year(@endDate) - year(@startDate) >= 1
            THEN 366
            ELSE DAYOFYEAR(@endDate)
        END
    --Additional query to search less than portion of next year
    OR RS.PartitionsID <=
        CASE
            WHEN year(@endDate) - year(@startDate) > 1
                OR DAYOFYEAR(@startDate) > DAYOFYEAR(@endDate)
            THEN DAYOFYEAR(@endDate)
            ELSE NULL
        END
    )
GROUP BY ReportDate;

Option 2: Using CalendarDays table. This option is much cleaner. The downside is you will need to create a new CalendarDays table if you do not have one.

SELECT DATE(RS.ReportTime) AS ReportDate, SUM(RS.Amount) AS Total
FROM Report_Summary RS
WHERE RS.ReportTime >= @startDate AND RS.ReportTime <= @endDate
    AND RS.PartitionsID IN
    (
        SELECT DISTINCT DAYOFYEAR(c.calDate) 
        FROM dbo.calendarDays c
        WHERE c.calDate >= @startDate and c.calDate <= @endDate
    )

EDIT: Option 3: variation of option 1, but using Union All to search each range separately. The idea here is that since there is not an OR in the statement, that the optimizer will be able to apply the partition pruning. Note: I do not normally work in MySQL, so my syntax may be a little off, but the general idea is there.

DECLARE @startDate datetime, @endDate datetime;
DECLARE @rangeOneStart datetime, @rangeOneEnd datetime, @rangeTwoStart datetime, @rangeTwoEnd datetime;

SELECT @rangeOneStart := 
        CASE 
            WHEN
                --more than one year, search all days 
                year(@endDate) - year(@startDate) > 1
                --one full year difference 
                OR year(@endDate) - year(@startDate) = 1 
                    AND DAYOFYEAR(@startDate) <= DAYOFYEAR(@endDate)
            THEN 1
            ELSE DAYOFYEAR(@startDate)
        END
    , @rangeOneEnd := 
        CASE
            WHEN 
                --query spans the end of a year
                year(@endDate) - year(@startDate) >= 1
            THEN 366
            ELSE DAYOFYEAR(@endDate)
        END 
    , @rangeTwoStart := 1
    , @rangeTwoEnd := 
        CASE
            WHEN year(@endDate) - year(@startDate) > 1
                OR DAYOFYEAR(@startDate) > DAYOFYEAR(@endDate)
            THEN DAYOFYEAR(@endDate)
            ELSE NULL
        END
;

SELECT t.ReportDate, sum(t.Amount) as Total
FROM 
(
    SELECT DATE(RS.ReportTime) AS ReportDate, RS.Amount
    FROM Report_Summary RS
    WHERE RS.PartitionsID BETWEEN @rangeOneStart AND @rangeOneEnd
        AND RS.ReportTime >= @startDate AND RS.ReportTime <= @endDate

    UNION ALL

    SELECT DATE(RS.ReportTime) AS ReportDate, RS.Amount
    FROM Report_Summary RS
    WHERE RS.PartitionsID BETWEEN @rangeTwoStart AND @rangeTwoEnd
        AND @rangeTwoEnd IS NOT NULL
        AND RS.ReportTime >= @startDate AND RS.ReportTime <= @endDate
) t
GROUP BY ReportDate;
BateTech
  • 5,780
  • 3
  • 20
  • 31
  • Thanks for your valueable response, but I had already tried for CASE statement and it's also not using proper partitions and indexes to fetch the data. Sencond option also won't consider proper partitions and indexes – Saharsh Shah Jan 06 '15 at 05:38
  • What if you add an index hint in your FROM clause like `FROM Report_Summary RS USE KEY (UNIQUE) ` , and also move so that the PartitionsID is the first statement in the where clause and then try option 2 again? http://dev.mysql.com/doc/refman/5.7/en/index-hints.html – BateTech Jan 06 '15 at 11:55
  • That will utilize index on that query but it won't change the use of partitions. It will use all partitions instead of specific partitions and I want to use specific partitons should be used by my query. So index won't work in my case. – Saharsh Shah Jan 06 '15 at 12:19
  • I just updated this answer and added option 3 which may work for you. – BateTech Jan 07 '15 at 13:36
  • I am quite sure that only the CalendarDays solution will bring a reasonable execution plan and execution time. Such a table is also needed when you might have Zero-Days, so days without errors or subscriptions, and want to show also those 'gaps'. – flaschenpost Jan 12 '15 at 08:13
0

To start addressing this problem, you need a subquery to, given a date range, return a resultset consisting of all the DAYOFYEAR() values in that range.

Let us work that out. For starters, we need a query that can return a sequence of all the integers from 0 to at least 366. Here is that query. It gives back a column of seq values 0-624.

SELECT A.N + 5*(B.N + 5*(C.N + 5*(D.N))) AS seq
  FROM (SELECT 0 AS N UNION SELECT 1 UNION SELECT 2 
                      UNION SELECT 3 UNION SELECT 4) AS A
  JOIN (SELECT 0 AS N UNION SELECT 1 UNION SELECT 2
                      UNION SELECT 3 UNION SELECT 4) AS B
  JOIN (SELECT 0 AS N UNION SELECT 1 UNION SELECT 2
                      UNION SELECT 3 UNION SELECT 4) AS C
  JOIN (SELECT 0 AS N UNION SELECT 1 UNION SELECT 2
                      UNION SELECT 3 UNION SELECT 4) AS D

(This is simple cross-join trickery to generate all combinations of 5**4 numbers.)

Next, we need to use this to generate the list of DAYOFYEAR() values. Let's use your starting and ending dates for the sake of the example. This query generates a result set containing a bunch of integers showing the days of the year in that date range.

SELECT DISTINCT DAYOFYEAR(first_day + INTERVAL seq DAY) doy
  FROM (SELECT DATE('2014-12-26 00:00:00') AS first_day,
               DATE('2015-01-01 23:59:59') AS last_day
       ) params
  JOIN (
         SELECT A.N + 5*(B.N + 5*(C.N + 5*(D.N))) AS seq
           FROM (SELECT 0 AS N UNION SELECT 1 UNION SELECT 2 
                               UNION SELECT 3 UNION SELECT 4) AS A
           JOIN (SELECT 0 AS N UNION SELECT 1 UNION SELECT 2
                               UNION SELECT 3 UNION SELECT 4) AS B
           JOIN (SELECT 0 AS N UNION SELECT 1 UNION SELECT 2
                               UNION SELECT 3 UNION SELECT 4) AS C
           JOIN (SELECT 0 AS N UNION SELECT 1 UNION SELECT 2
                               UNION SELECT 3 UNION SELECT 4) AS D
       ) seq ON seq.seq <= TIMESTAMPDIFF(DAY,first_day,last_day)
 ORDER BY 1

I think you can convince yourself that this gnarly little query works correctly for any reasonable range of days spanning about a year and a half (625 days) or less. If you use longer spans of time you may mess up leap years.

Finally, you can use this query in your PartitionsID IN () clause. That would look like this.

SELECT DATE(RS.ReportTime) AS ReportDate, SUM(RS.Amount) AS Total
  FROM Report_Summary RS
 WHERE RS.ReportTime >= '2014-12-26 00:00:00'
   AND RS.ReportTime <= '2015-01-01 23:59:59'
   AND RS.PartitionsID 
     IN (
         SELECT DISTINCT DAYOFYEAR(first_day + INTERVAL seq DAY) doy
           FROM (SELECT DATE('2014-12-26 00:00:00') AS first_day,
                        DATE('2015-01-01 23:59:59') AS last_day
                ) params
           JOIN (
                  SELECT A.N + 5*(B.N + 5*(C.N + 5*(D.N))) AS seq
                    FROM (SELECT 0 AS N UNION SELECT 1 UNION SELECT 2 
                                        UNION SELECT 3 UNION SELECT 4) AS A
                    JOIN (SELECT 0 AS N UNION SELECT 1 UNION SELECT 2
                                        UNION SELECT 3 UNION SELECT 4) AS B
                    JOIN (SELECT 0 AS N UNION SELECT 1 UNION SELECT 2
                                        UNION SELECT 3 UNION SELECT 4) AS C
                    JOIN (SELECT 0 AS N UNION SELECT 1 UNION SELECT 2
                                        UNION SELECT 3 UNION SELECT 4) AS D
                ) seq ON seq.seq <= TIMESTAMPDIFF(DAY,first_day,last_day)
          ORDER BY 1
         ) 
GROUP BY ReportDate; 

That should do it for you.

If you are using MariaDB 10+, there are built in sequence tables named like seq_0_to_624.

There's a writeup on this topic here:

http://www.plumislandmedia.net/mysql/filling-missing-data-sequences-cardinal-integers/

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • I had tested this but the query uses all partitions instead of specific partitions. And also I had tried before by creating a table with 366 entries of numbers and tried to JOIN table with the query, but still I got failure. – Saharsh Shah Jan 09 '15 at 06:40
  • I suggest you run a before-and-after performance test on an *actual query*, not just `EXPLAIN`, leaving out or putting in the `AND RS.PartitionsID IN (...)` part of the statement. It may be that you are still getting a performance benefit from the attempt to list the partitions needed, even if it doesn't show up in `EXPLAIN`. Of course, it may be that you have the `UNION ALL` of a lot of one-day queries in your near future. – O. Jones Jan 09 '15 at 12:23
0

I got the solutions for that I had changed my logic of storing PartitionsId column in my table. Initially I am storing DayOfYear(reportTime) column in PartitionsId column. Now I had changed that logic by storing TO_DAYS(reportTime) and stored into PartitionsId column.

So my table structure is as below:

CREATE TABLE `Report_Summary` (
  `PartitionsID` int(10) unsigned NOT NULL,
  `ReportTime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `Amount` int(10) NOT NULL,
  UNIQUE KEY `UNIQUE` (`PartitionsID`,`ReportTime`),
  KEY `PartitionsID` (`PartitionsID`),
  KEY `ReportTime` (`ReportTime`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPRESSED
/*!50100 PARTITION BY HASH (PartitionsID)
PARTITIONS 366 */

INSERT INTO `Report_Summary` (`PartitionsID`, `ReportTime`, `Amount`) VALUES('735928','2014-12-26 11:46:12','100');
INSERT INTO `Report_Summary` (`PartitionsID`, `ReportTime`, `Amount`) VALUES('735929','2014-12-27 11:46:23','50');
INSERT INTO `Report_Summary` (`PartitionsID`, `ReportTime`, `Amount`) VALUES('735930','2014-12-28 11:46:37','44');
INSERT INTO `Report_Summary` (`PartitionsID`, `ReportTime`, `Amount`) VALUES('735931','2014-12-29 11:46:49','15');
INSERT INTO `Report_Summary` (`PartitionsID`, `ReportTime`, `Amount`) VALUES('735932','2014-12-30 11:46:59','56');
INSERT INTO `Report_Summary` (`PartitionsID`, `ReportTime`, `Amount`) VALUES('735933','2014-12-31 11:47:22','68');
INSERT INTO `Report_Summary` (`PartitionsID`, `ReportTime`, `Amount`) VALUES('735934','2015-01-01 11:47:35','76');
INSERT INTO `Report_Summary` (`PartitionsID`, `ReportTime`, `Amount`) VALUES('735935','2015-01-02 11:47:43','88');
INSERT INTO `Report_Summary` (`PartitionsID`, `ReportTime`, `Amount`) VALUES('735936','2015-01-03 11:47:59','77');

Check the SQL FIDDLE DEMO:

My query is:

EXPLAIN PARTITIONS 
SELECT DATE(RS.ReportTime) AS ReportDate, SUM(RS.Amount) AS Total
FROM Report_Summary RS
WHERE RS.ReportTime >= '2014-12-26 00:00:00' AND RS.ReportTime <= '2015-01-01 23:59:59' AND 
      RS.PartitionsID BETWEEN TO_DAYS('2014-12-26 00:00:00') AND TO_DAYS('2015-01-01 23:59:59')
GROUP BY ReportDate; 

The above query scans specific partitions which I need and it also uses the proper index. So I reached to proper solution after changing of logic of PartitionsId column.

Thanks for all the replies and Many thanks to everyone's time...

Saharsh Shah
  • 28,687
  • 8
  • 48
  • 83
  • Be careful: When you run a longer time, you will get many partitions, since every new day creates one. I would definitely suggest a persistent calendar table with one line for every day and the correct partition number, from where you can select the `where in` part. – flaschenpost Jan 12 '15 at 08:10
  • With only 366 partitions, you will likely run into the same problem you originally had, only the break between PartitionsID 366 and 1 will be somewhere other than Dec 31 - Jan 1. – BateTech Jan 12 '15 at 11:33
  • My point is that since you're using HASH partitioning, your partition # is generated using the formula `MOD(TO_DAYS(ReportTime), 366)` (http://dev.mysql.com/doc/refman/5.7/en/partitioning-hash.html), so now your partition # "reset" from 365 back to 0 will occur around 2015-04-02 instead of 2014-12-31. So your SQLFiddle does not prove a fix for the original issue because is does not cover a date range that spans this partition # "reset". MySQL may handle this "reset" when using `TO_DAYS` much better than using `DAYOFYEAR` since the `TO_DAYS` fn is linear but your example does not show this. – BateTech Jan 12 '15 at 13:38
  • Abandon `BY HASH`. @BateTech explains why. `BY RANGE` would be better, but still not as good as abandoning PARTITIONing and simply having `PRIMARY KEY(ReportTime)`. The goal of partitioning is to cut down on I/O needed to perform a task. No partitioning solution discussed so far does any better than this non-PARTITION solution. – Rick James Mar 06 '15 at 20:21
0

Based on your SELECT, what you really need is a Data Warehousing technique called "Summary Tables". With such, you summarize the data each day (or hour or whatever) and store the subtotals in a much smaller table. Then the "report" looks at that table and totals up the subtotals. This is often 10x faster than the brute force scan of the raw data. More details: http://mysql.rjweb.org/doc.php/datawarehouse .

Doing that eliminates the need for PARTITIONing in either the raw data ("Fact table") or the summary table.

However, if you need to purge the old data, then PARTITIONing can come in handy because of DROP PARTITION. For this you would use BY RANGE(TO_DAYS(...)), not BY HASH.

Rick James
  • 135,179
  • 13
  • 127
  • 222