Although several . questions . come . close . to what I want (and as I write this stackoverflow has suggested several more, none of which quite capture my problem), I just don't seem to be able to find my way out of the SQL thicket.
I have a single table (let's call it the user_classification_fct) that has three fields: user, week, and class (e.g. user #1 in week #1 had a class of 'Regular User', while user #2 in week #1 has a class of 'Infrequent User'). (As an aside, I have implemented classes as INTs, but wanted to work with something legible in the form of VARCHAR while I sorted out the SQL.)
What I want to do is produce a summary report of how user behaviour is changing in aggregate along the lines of:
- There were 50 users who were regular users in both week 1 and week 2 and ...
- There were 10 users who were regular users in week 1, but fell to infrequent users in week 2
- There were 5 users who went from infrequent in week 1 to regular in week 2
- ... and so on ...
What makes this slightly more tricky is that user #5000 might only have started using the service in week 2 and so have no record in the table for week 1. In that case, I'd want to see a NULL FOR week 1 and a 'Regular User' (or whatever is appropriate) for week 2. The size of the table is not strictly relevant, but with 5 weeks' worth of data I'm looking at 42 million rows, so I do not want to insert 4 'fake' rows of 'Non-User' for someone who only starts using the service in week 5 or something.
To me this seems rather obviously like a case for using a LEFT or RIGHT JOIN in MySQL because the NULL should come through on the 'missing' record.
I have tried using both WHERE and AND conditions on the LEFT JOINs and am just not getting the 'right' answers (i.e. I either get no NULL values at all in the case of trailing WHERE conditions, or my counts are far, far too high for the number of distinct users (which is ca. 10 million) in the case of the AND constraints used below). Here's was my last attempt to get this working:
SELECT
ucf1.class_nm AS 'Class in 2012/15',
ucf2.class_nm AS 'Class in 2012/16',
ucf3.class_nm AS 'Class in 2012/17',
ucf4.class_nm AS 'Class in 2012/18',
ucf5.class_nm AS 'Class in 2012/19',
count(*) AS 'Count'
FROM
user_classification_fct ucf5
LEFT JOIN user_classification_fct ucf4
ON ucf5.user_id=ucf4.user_id
AND ucf5.week_key=201219 AND ucf4.week_key=201218
LEFT JOIN user_classification_fct ucf3
ON ucf4.user_id=ucf3.user_id
AND ucf4.week_key=201218 AND ucf3.week_key=201217
LEFT JOIN user_classification_fct ucf2
ON ucf3.user_id=ucf2.user_id
AND ucf3.week_key=201217 AND ucf2.week_key=201216
LEFT JOIN user_classification_fct ucf1
ON ucf2.user_id=ucf1.user_id
AND ucf2.week_key=201216 AND ucf1.week_key=201215
GROUP BY 1,2,3,4,5;
In looking at the various other questions on stackoverflow.com, it may well be that I need to perform the queries one-at-a-time and UNION the result sets together or use parentheses to chain them one-to-another, but those approaches are not ones that I'm familiar with (yet) and I can't even get a single LEFT JOIN (i.e. week 5 to week 1, dropping all the other weeks of data) to return something useful.
Any tips would be much, much appreciated and I would really appreciate suggestions that work in MySQL as switching database products is not an option.