0

Different variations of this question have been asked before, but none for the use case that I'm looking for. I'd like to find the specific number of weekdays between two dates for each row of a MySQL table and then update a column of each row with the result of that operation. This is part of an ETL process, and I'd like to keep this in a stored procedure if at all possible.

Data

Dates are of DATE type and I'd like to find the number of a specific because I have 7 day columns that have a flag if a record occurs on that day of the week. Like this (1 is Monday):

day_1 | day_2 | day_3 | day_4 | day_5 | day_6 | day_7
----- | ----- | ----- | ----- | ----- | ----- | -----
0     | 1     | 0     | 1     | 1     | 0     | 1

Example Use Case

I'm doing this because I'm trying to find the frequency of rows for a timeframe that's not available in the input data (call it input). So for a record that had start and end date values of 2016-01-01 and 2016-03-01, I'd want to know how often that record would have occurred only from 2016-01-01 to 2016-01-31, inclusive. I initially tried to do this by making a table that contained all datevalues for many years into the future like:

datevalue
---------
2016-01-01
2016-01-02
...

and then joining input to that table on start_date and end_date and then aggregating up while counting the number of each day like this:

SUM(CASE WHEN WEEKDAY(B.datevalue) + 1 = 1 THEN 1 ELSE 0 END) * day_1 +
        SUM(CASE WHEN WEEKDAY(B.datevalue) + 1 = 2 THEN 1 ELSE 0 END) * day_2 +
        SUM(CASE WHEN WEEKDAY(B.datevalue) + 1 = 3 THEN 1 ELSE 0 END) * day_3 +
        SUM(CASE WHEN WEEKDAY(B.datevalue) + 1 = 4 THEN 1 ELSE 0 END) * day_4 +
        SUM(CASE WHEN WEEKDAY(B.datevalue) + 1 = 5 THEN 1 ELSE 0 END) * day_5 +
        SUM(CASE WHEN WEEKDAY(B.datevalue) + 1 = 6 THEN 1 ELSE 0 END) * day_6 +
        SUM(CASE WHEN WEEKDAY(B.datevalue) + 1 = 7 THEN 1 ELSE 0 END) * day_7 AS adj_total_frequency

That worked perfectly on a smaller dataset, but input has > 30 million records, and when I tried running on that procedure it ran for 36 hours before I killed it.

Is there a more efficient way of doing this in MySQL?

gr1zzly be4r
  • 2,072
  • 1
  • 18
  • 33
  • You could precalculate `WEEKDAY(B.datevalue) + 1` and store it in the table with `datevalue`; not sure how much it will speed it up though. The alternative is not datesubbing the `end` and `start` and figuring out the calculation to remove the appropriate number of days from that. – Uueerdo Aug 12 '16 at 16:45
  • Precalculating `WEEKDAY() + 1` speeds up the query that aggregates just the `datevalue` column in the `dates` table by 25%, but I don't know if that improvement scales linearly. What do you mean not datesubbing? – gr1zzly be4r Aug 12 '16 at 17:00
  • Oh, I'm not sure how the "not" got in there. I meant the alternative is to datesub. – Uueerdo Aug 12 '16 at 17:01

2 Answers2

1

Too long for a comment but, combining with the pre-calculation of weekday I originally suggested, how much does this (using a single SUM with a complete CASE) work out for you?

SUM(CASE WHEN B.weekdayval = 1 AND day_1 THEN 1
        WHEN B.weekdayval = 2 AND day_2 THEN 1
        WHEN B.weekdayval = 3 AND day_3 THEN 1
        WHEN B.weekdayval = 4 AND day_4 THEN 1
        WHEN B.weekdayval = 5 AND day_5 THEN 1
        WHEN B.weekdayval = 6 AND day_6 THEN 1
        WHEN B.weekdayval = 7 AND day_7 THEN 1 
        ELSE 0 END) AS adj_total_frequency

actually this could be better; it could theoretically mean B.weekdayval only gets compared once per row (I say theoretically because MySQL does not guarantee irrelevant THEN clauses will not be evaluated, just not "returned" from the CASE).

SUM(CASE WHEN day_1 THEN B.weekdayval = 1
        WHEN day_2 THEN B.weekdayval = 2
        WHEN day_3 THEN B.weekdayval = 3
        WHEN day_4 THEN B.weekdayval = 4
        WHEN day_5 THEN B.weekdayval = 5
        WHEN day_6 THEN B.weekdayval = 6
        WHEN day_7 THEN B.weekdayval = 7 
        ELSE 0 END) AS adj_total_frequency

Edit: As far as the datesub method goes, I don't have the time to write a full solution, but to start you (or other potential answerers) on that...

  • I meant DATEDIFF
  • you can get the number of whole weeks between the start and end with DATEDIFF(end, start) DIV 7
  • multiply that by the number of days in a week that apply to get an approximation
  • then (the hardest part), figure out the number of days to add for the fractional week not covered by div.
Uueerdo
  • 15,723
  • 1
  • 16
  • 21
  • Wouldn't the `DATEDIFF` method not be accurate throughout the year, though? It's going to struggle when months have differeing numbers of days in them. – gr1zzly be4r Aug 12 '16 at 18:13
  • `DATEDIFF` gives the number of days between, the variability of months shouldn't factor into this calculation; it would if months reset the week prematurely, but that is not how our calendar works. To state the obvious for clarity, months start on the day of the week after the day of the week the previous month ended on. – Uueerdo Aug 12 '16 at 18:18
  • I'm accepting this answer because it's closer to the solution that I ended up using, which was using a function from [this answer here](http://stackoverflow.com/a/27377521/3991782). – gr1zzly be4r Aug 16 '16 at 20:19
1

(Sometimes) MySQL has big troubles optimizing GROUP BY statements with a JOIN. To overcome that you can store the joined result into a temporary table so you can use GROUP BY with one table.

drop temporary table if exists tmp;
create temporary table tmp (id int unsigned not null)
engine=myisam
    select i.id
    from input i
    straight_join dates B 
        on  B.datevalue >= i.`start`
        and B.datevalue <  i.`end`
    where (
        (WEEKDAY(B.datevalue ) = 0) AND i.day_7 OR
        (WEEKDAY(B.datevalue ) = 1) AND i.day_1 OR
        (WEEKDAY(B.datevalue ) = 2) AND i.day_2 OR
        (WEEKDAY(B.datevalue ) = 3) AND i.day_3 OR
        (WEEKDAY(B.datevalue ) = 4) AND i.day_4 OR
        (WEEKDAY(B.datevalue ) = 5) AND i.day_5 OR
        (WEEKDAY(B.datevalue ) = 6) AND i.day_6
    )
    -- and i.id >  000000
    -- and i.id <= 100000
;

drop temporary table if exists tmp1;
create temporary table tmp1 (id int unsigned not null, cnt int unsigned not null)
engine=myisam
    select id, count(1) as cnt
    from tmp
    group by id
;

update input i
join tmp1 using(id)
set i.numdays = tmp1.cnt
where 1=1;

My test data contains 1M rows with random day bits (round(rand())) and an average date range of 50 days. So the tmp table contains about 25M rows.

On my system it takes about 500 msec for 10K rows, 5 sec for 100K rows and 2 mins for 1M rows. So if you split the updates in chunks of 100K rows (using the commented id range condition in the first statement) you should be ready in about 30 minutes.

Paul Spiegel
  • 30,925
  • 5
  • 44
  • 53
  • Thank you for the answer, but I ended up using an answer that was more similar to the other answer that was given for this question, ([this answer here](http://stackoverflow.com/a/27377521/3991782).) – gr1zzly be4r Aug 16 '16 at 20:20