2

Although several . questions . come . close . to what I want (and as I write this stackoverflow has suggested several more, none of which quite capture my problem), I just don't seem to be able to find my way out of the SQL thicket.

I have a single table (let's call it the user_classification_fct) that has three fields: user, week, and class (e.g. user #1 in week #1 had a class of 'Regular User', while user #2 in week #1 has a class of 'Infrequent User'). (As an aside, I have implemented classes as INTs, but wanted to work with something legible in the form of VARCHAR while I sorted out the SQL.)

What I want to do is produce a summary report of how user behaviour is changing in aggregate along the lines of:

  1. There were 50 users who were regular users in both week 1 and week 2 and ...
  2. There were 10 users who were regular users in week 1, but fell to infrequent users in week 2
  3. There were 5 users who went from infrequent in week 1 to regular in week 2
  4. ... and so on ...

What makes this slightly more tricky is that user #5000 might only have started using the service in week 2 and so have no record in the table for week 1. In that case, I'd want to see a NULL FOR week 1 and a 'Regular User' (or whatever is appropriate) for week 2. The size of the table is not strictly relevant, but with 5 weeks' worth of data I'm looking at 42 million rows, so I do not want to insert 4 'fake' rows of 'Non-User' for someone who only starts using the service in week 5 or something.

To me this seems rather obviously like a case for using a LEFT or RIGHT JOIN in MySQL because the NULL should come through on the 'missing' record.

I have tried using both WHERE and AND conditions on the LEFT JOINs and am just not getting the 'right' answers (i.e. I either get no NULL values at all in the case of trailing WHERE conditions, or my counts are far, far too high for the number of distinct users (which is ca. 10 million) in the case of the AND constraints used below). Here's was my last attempt to get this working:

SELECT
    ucf1.class_nm AS 'Class in 2012/15',
    ucf2.class_nm AS 'Class in 2012/16',
    ucf3.class_nm AS 'Class in 2012/17',
    ucf4.class_nm AS 'Class in 2012/18',
    ucf5.class_nm AS 'Class in 2012/19',
    count(*) AS 'Count'
FROM
    user_classification_fct ucf5
LEFT JOIN user_classification_fct ucf4 
    ON ucf5.user_id=ucf4.user_id 
        AND ucf5.week_key=201219 AND ucf4.week_key=201218
LEFT JOIN user_classification_fct ucf3 
    ON ucf4.user_id=ucf3.user_id 
       AND ucf4.week_key=201218 AND ucf3.week_key=201217
LEFT JOIN user_classification_fct ucf2 
    ON ucf3.user_id=ucf2.user_id 
       AND ucf3.week_key=201217 AND ucf2.week_key=201216
LEFT JOIN user_classification_fct ucf1 
    ON ucf2.user_id=ucf1.user_id 
       AND ucf2.week_key=201216 AND ucf1.week_key=201215
GROUP BY 1,2,3,4,5;

In looking at the various other questions on stackoverflow.com, it may well be that I need to perform the queries one-at-a-time and UNION the result sets together or use parentheses to chain them one-to-another, but those approaches are not ones that I'm familiar with (yet) and I can't even get a single LEFT JOIN (i.e. week 5 to week 1, dropping all the other weeks of data) to return something useful.

Any tips would be much, much appreciated and I would really appreciate suggestions that work in MySQL as switching database products is not an option.

Community
  • 1
  • 1
  • It looks like you're trying to do a `PIVOT` in MySQL - if you haven't already, try searching for that. If that's not relevant, perhaps some sample data and an expected result set may help – podiluska Aug 15 '12 at 12:01
  • Hmmm, the pivot table approach looks about right, though it'll be 'interesting' to figure out how to convert the examples to my case. Will post (hopefully) functioning code after I've had a play. Thanks! – user1600439 Aug 15 '12 at 12:54

1 Answers1

1

You can do this with a group by. I would start by summarizing all the possible combinations for the five weeks as:

select c_201215, c_201216, c_201217, c_201218, c_201219,
       count(*) as cnt
from (select user_id,
             max(case when week_key=201215 then class_nm end) as c_201215,
             max(case when week_key=201216 then class_nm end) as c_201216,
             max(case when week_key=201217 then class_nm end) as c_201217,
             max(case when week_key=201218 then class_nm end) as c_201218,
             max(case when week_key=201219 then class_nm end) as c_201219
      from user_classification_fct  ucf
      group by user_id
     ) t
group by c_201215, c_201216, c_201217, c_201218, c_201219

This may solve your problem. If you have 5 classes (including NULL), then this will return at most 5^5 or 3,125 rows.

This fits into Excel, so you can do the final processing there. Alternatively, you can still use the database.

If you want to extract pairs of weeks, then I would suggest putting the above into a temporary table, say "t". And doing a series of extracts with unions:

select *
from ((select '201215' as weekstart, c_201215, c_201216, sum(cnt) as cnt
       from t
       group by c_201215, c_201216
      ) union all
      (select '201216', c_201216, c_201217, sum(cnt) as cnt
       from t
       group by c_201216, c_201217

      ) union all
      (select '201217', c_201217, c_201218, sum(cnt) as cnt
       from t
       group by c_201217, c_201218

      ) union all
      (select '201218', c_201218, c_201219, sum(cnt) as cnt
       from t
       group by c_201218, c_201219
      )
     ) tg
order by 1, cnt desc

I suggest putting it in a subquery because you don't want to message around with common-subquery optimizations on such a large table. You'll get to your final answer by summarizing first, and then bringing the data together.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Grrrr, can't seem to post the output from the first query, but that was exactly what I needed -- just a sense of how many people fall into each bucket as they accumulate 'history'. Performance isn't *stellar* (though, of course, I'm also trying to do something that isn't entirely db friendly to begin with) but is 100% acceptable for my purposes (19 minutes to return 457 rows from 42 million records). If I needed to continue doing this kind of report indefinitely then I'd eventually have to look for another solution, but this will work for the foreseeable future. Thank you! – user1600439 Aug 15 '12 at 14:25
  • Should also point out that I'd up-vote this if I could, but I apparently have a poor reputation in these parts. ;-) – user1600439 Aug 15 '12 at 14:59