2

I am working with MySql and Symfony2. I need to build cohort analysis table. I need to compare how many users in each cohort log in to website at least once a week after they register. What I tried to do is to get number of registered users by week, basically these are my cohorts.

 SELECT DATE_FORMAT(date_added,'%d %b %y') as reg_date, COUNT(*) AS user_count 
 FROM user 
 WHERE date_added>='2016-02-01' AND date_added<=NOW() 
 GROUP BY WEEK(date_added)

This query gets distinct users logged in to website by week.

 SELECT WEEK(login_date) AS week, COUNT(DISTINCT user_id) AS user_count
 FROM user_log
 WHERE login_date>='2016-02-01' AND login_date<=NOW()
 GROUP BY WEEK(login_date)

My problem: I can't figure out how to group logged in users by cohorts and compare cohorts by weeks. I hope I stated problem clearly. English is not my first language. Thanks.

Sample data:

 user table 

 id      | date_added (in WEEK() format) 
 A       | 1  
 B       | 1  
 C       | 1   
 D       | 2
 E       | 2
 F       | 2
 G       | 2
 ------------

 user_log table 
 user_id | login_date (in WEEK() format)
 A       | 1   
 B       | 1
 B       | 1
 A       | 2
 D       | 2
 A       | 2   
 D       | 2
 E       | 2

Expected table. Cohort 1 - users registered in week 1, cohort 2- in week etc. Size - number of registered users. Week 1 - how many users logged back to website in a first week after registration, Week 2 - how many users logged back to website in a second week after registration

 Cohort     size    Week1   Week2
 Cohort 1 |  3   |    2   |    1   |    
 Cohort 2 |  4   |    2   |    -   |
  • Sample data would go a long way towards your getting a quick answer. – Tim Biegeleisen Feb 09 '16 at 10:41
  • @TimBiegeleisen I updated my question. Thanks for advice –  Feb 09 '16 at 11:01
  • Can there be any arbitrary number of weeks? This looks like it's going to take a pivot query (among other things) to solve your problem. A pivot with variable number of columns is non-trivial in MySQL. – Tim Biegeleisen Feb 09 '16 at 11:05
  • 1
    @TimBiegeleisen I don't really understand what do you mean by arbitrary? Week number here is just an example. In real result I get week number by converting data with this format: WEEK('2016-02-01') will give 5 (because it's 5th week in calendar, week start at 00:00:00 on Monday), and so on. –  Feb 09 '16 at 11:13
  • Database queries are meant to grew row-wise automatically, but creating a dynamic number of columns in a result set might require dynamic SQL (programming). That being said, is the number of weeks _fixed_, or does the query have to tolerate an unknown number of weeks based on the data? – Tim Biegeleisen Feb 09 '16 at 11:14
  • 1
    @TimBiegeleisen for now I only want to get data for this period of time: login_date>='2016-02-01' AND login_date<=NOW(). i.e number of week fixed - 2 weeks –  Feb 09 '16 at 11:37

1 Answers1

1

This is borrowed from my modification of @Andriy M's answer of this question: Cohort analysis in SQL

This query gets unique user logins by week after registering.

SELECT DISTINCT
    user_id,
    FLOOR(DATEDIFF(user_log.login_date, user.date_added)/7) AS Offset
    FROM user_log
    LEFT JOIN user ON (user.id = user_log.user_id)
    WHERE user_log.login_date >= CURDATE() - INTERVAL 14 DAY

This query gets all the users created in the past 14 days and formats the date to the week they signed up:

 SELECT
    id,
    DATE_FORMAT(date_added, "%Y-%u") AS cohort
  FROM user
  WHERE date_added >= CURDATE() - INTERVAL 14 DAY

We can put those two queries together to get a table with how many people came back after registering:

SELECT STR_TO_DATE(CONCAT(u.cohort, ' Monday'), '%X-%V %W') as date,
  SUM(s.Offset = 0) AS size,
  SUM(s.Offset = 1) AS Week1,
  SUM(s.Offset = 2) AS Week2
FROM (
 SELECT
    id,
    DATE_FORMAT(date_added, "%Y-%u") AS cohort
  FROM user
  WHERE date_added >= CURDATE() - INTERVAL 21 DAY
) as u
LEFT JOIN (
    SELECT DISTINCT
    user_id,
    FLOOR(DATEDIFF(user_log.login_date, user.date_added)/7) AS Offset
    FROM user_log
    LEFT JOIN user ON (user.id = user_log.user_id)
    WHERE user_log.login_date >= CURDATE() - INTERVAL 21 DAY
) as s
ON s.user_id = u.id
GROUP BY u.cohort
ORDER BY u.cohort

Since we aren't counting how many people registered in a given week, we are assuming that they logged at lease once in the week they registered to give an accurate result for the size column.

Also you'll have to rework this to get a number for the cohort instead of the date, but I find dates more helpful.

Also you can extend this to more weeks - you'll have to change the number of days after INTERVAL in both subqueries, and you can add more rows on in the main select statement to get more weeks.

Community
  • 1
  • 1
CovertIII
  • 1,053
  • 1
  • 11
  • 18