-1

Suppose you have a table like this:

table users:

  • userid (PK, id of the user)
  • regdate (datetime, date of registration)
  • idprovince (external id of address province)

If I write a query as such:

SELECT
    DATE_FORMAT(regdate,"%Y-%m-01") as regmonth,
    idprovince                                 ,
    count(userid) as num
FROM
    users
GROUP BY
    DATE_FORMAT(regdate,"%Y-%m-01"),
    idprovince

This will correctly produce a grouped result which will show how many NEW users have registered in any given month and province.

Suppose now I want the cumulative number of users in each province in any given month (the value for any given month and province should be the sum of the new users of that month and province AND all the previous months of that very same province), how should I build the query to be efficient?

I've tried using a subquery as such:

SELECT
    DATE_FORMAT(regdate,"%Y-%m-01") as regmonth,
    idprovince                                 ,
    (SELECT
        COUNT(userid)
    FROM
        users AS counting_0_tbl
    WHERE DATE_FORMAT(counting_0_tbl.regdate,"%Y-%m-01")<=DATE_FORMAT(users.regdate,"%Y-%m-01")
        AND counting_0_tbl.idprovince                    = users.idprovince
    ) as num
FROM
    users
GROUP BY
    DATE_FORMAT(regdate,"%Y-%m-01"),
    idprovince

And IT WORKS PROPERLY, but takes AGES to run, something like 70+ seconds on a 70k lines table.

Any idea of how can I make it more efficient?

I'm increasingly thinking of sticking to the base query and do the cumulation in a second stage, in code...

I'm using Mysql 5.5, but if useful I can upgrade to MySQL 8.

Thanks for any help!

nbk
  • 45,398
  • 8
  • 30
  • 47
Shores
  • 95
  • 7

3 Answers3

0

With mysql 5.5 you use user-defined variables, to sum up number from different rows.

You have to keep the order of the column else the algorithm will not work

CREATE tABLE users (userid int,regdate date,idprovince int )
INSERT INTO users VALUEs (1,'2020-01-21',1),(2,'2020-02-21',1),(3,'2020-03-21',1),
(4,'2020-01-21',2),(5,'2020-02-21',2),(6,'2020-03-21',2)
    SELECT 
        regmonth, 
        IF(@idprovince = idprovince,@num:=@num + `num` , @num:= `num`)  as num,
        @idprovince := idprovince as idprovince
    FROM
        (SELECT 
            DATE_FORMAT(regdate, '%Y-%m-01') AS regmonth,
                idprovince,
                COUNT(userid) AS num
        FROM
            users
        GROUP BY DATE_FORMAT(regdate, '%Y-%m-01') , idprovince
        ORDER BY idprovince , DATE_FORMAT(regdate, '%Y-%m-01')) t1,(SELECT @num:=0,@idprovince := 0) t2
regmonth   | num | idprovince
:--------- | --: | ---------:
2020-01-01 |   1 |          1
2020-02-01 |   2 |          1
2020-03-01 |   3 |          1
2020-01-01 |   1 |          2
2020-02-01 |   2 |          2
2020-03-01 |   3 |          2

db<>fiddle here

nbk
  • 45,398
  • 8
  • 30
  • 47
  • Thanks! But I noticed in this moment, while testing your solution, which by the way does EXACTLY what my query does in a fraction of the time, that... Both of them do NOT work properly! The problem arises in these cases: suppose that in a month one user registered from province A, then the next month nobody registers from that province... Both our queries act wrong, since the do NOT insert a row in the second month repeating the running total from the previous month of province A... They both skip it! So, your answer is correct, but my question was wrong! ;) – Shores Feb 25 '21 at 01:39
  • Missing dates are possible in sql but a lot of work, when you do look for it here. . in mysql 8 it is simpler wit cte to build a table with all dates and left join it again the result table, so that you have month in a year like here https://stackoverflow.com/questions/6551179/mysql-find-missing-dates-between-a-date-range – nbk Feb 25 '21 at 08:51
0

Thanks to the input by @nbk , I managed to create this query, which is both fast and correct, and is based on the only assumption that there must be at least ONE user registered in each month; if that isn't the case, another way of generating the list of months should be researched.

SELECT
    regmonth      ,
    idprovince,
    num       ,
    cumnum
FROM
    (SELECT
        regmonth                                                                                 ,
        IF(@idprovince = idprovince,@cumnum:=@cumnum + `num` , @cumnum:= `num`) as cumnum    ,
        @idprovince := idprovince                                               as idprovince,
        num
    FROM
        ( select
            users2.regmonth                 ,
            users3.idprovince,
            coalesce(num,0) as num
        FROM
            (select
                date_format(regdate, "%Y-%m-01") as regmonth
            from
                users
            group by
                date_format(regdate, "%Y-%m-01")
            ) as users2
        CROSS JOIN provinces
            (select
                idprovince
            from
                users
            group by
                idprovince
            ) as users3
        LEFT JOIN
            (SELECT
                idprovince                                      ,
                DATE_FORMAT(users.regdate,"%Y-%m-01") as regmonth,
                count(id)                                as num
            from
                users
            GROUP BY
                idprovince,
                DATE_FORMAT(users.regdate,"%Y-%m-01")
            ) as users_totals on users_totals.idprovince=users3.idprovince AND user_totals.regmonth=users2.regmonth
        order by
            users3.idprovince,
            regmonth
        ) as t1      ,
        (SELECT @cumnum:=0,@idprovince := 0
        ) as t2
    ) as t3
ORDER BY
    regmonth,
    idprovince

In fact, the whole query is based on beginning with a CROSS JOIN (cartesian product) between all the months present as a regdate in the users table and all the province ids present in the users table. This makes sure that all the combinations of a month with an existing province id are represented.

Than we calculate the normal count in each of these groups, and join it to the cartesian product, adding a coalesce towards zero when the join fails.

Then the method proposed by @nbk is used to generate running totals, and finally an external query is put in place to restore the typical time based ordering (that has been changed to correctly sum cumulative totals).

And this finally works! :)

Shores
  • 95
  • 7
0

Build and maintain a "summary table" of daily subtotals. Update it each night, adding in just the new data for the previous day. Then, to get the "report", sum up the sums from that summary table. More discussion: http://mysql.rjweb.org/doc.php/summarytables

Rick James
  • 135,179
  • 13
  • 127
  • 222