1

I have the following set of MySQL queries, used to track user progress through a website. Is there a good way to simplify them?

#How many people reached stage 2
SELECT COUNT(DISTINCT a.session_id) as "total"
    FROM formation_page_hits a
    WHERE a.progress = 2
    AND DATE(a.datetime) = "2011-03-23";

#How many people reached stage 4 having reached stage 2
SELECT COUNT(DISTINCT a.session_id) as "total"
    FROM formation_page_hits a, (SELECT f.session_id, f.`datetime`
                        FROM formation_page_hits f
                        WHERE f.progress = 2) as b
    WHERE a.progress = 4
    AND a.session_id = b.session_id
    AND DATE(b.datetime) = "2011-03-23"
    AND DATE(a.datetime) = "2011-03-23";


#How many people reached stage 7, having reached stage 4, having reached stage 2
SELECT COUNT(DISTINCT a.session_id) as "total"
    FROM formation_page_hits a, (SELECT f.session_id, f.`datetime`
                        FROM formation_page_hits f
                        WHERE f.progress = 4) as b, (SELECT f.session_id, f.`datetime`
                        FROM formation_page_hits f
                        WHERE f.progress = 2) as c
    WHERE a.progress = 7
    AND a.session_id = b.session_id
    AND a.session_id = c.session_id
    AND DATE(c.datetime) = "2011-03-23"
    AND DATE(b.datetime) = "2011-03-23"
    AND DATE(a.datetime) = "2011-03-23";

As you can see, I'm very quickly re-querying the same information and there are an additional 4 or 5 queries that follow the same pattern - is there a better way of constructing the query that means I don't have to keep querying for "how many people reached stage 2"?

EDIT: each page view is stored as an entry in formation_page_hits - so that there is a complete record of page views for each session

id_formation_page_hits INT PRIMARY_KEY, session_id VARCHAR(100), datetime DATETIME, progress INT
Jon
  • 12,684
  • 4
  • 31
  • 44
  • Can we cache the results in the form of temp table? – Nitin Midha Mar 28 '11 at 16:04
  • Can a `session_id` have more than record for any `progress`? – Quassnoi Mar 28 '11 at 16:06
  • Yes, each `session_id` has multiple entries (one for each page view within that session) - each page has a specific `progress` number attached to it – Jon Mar 28 '11 at 16:10
  • I mean, can there be two entries for, say `(session_id = 1, progress = 2)`? – Quassnoi Mar 28 '11 at 16:22
  • @Quassnoi yes there can be, but only one of those entries should be counted - it doesnt matter, in this case, whether there is more than one entry per `session_id` for any particular value of `progress`, it just matters that there is **at least** 1 entry for each of the values of `progress` specified in the list – Jon Mar 28 '11 at 16:34

1 Answers1

1
SELECT  COUNT(*)
FROM    (
        SELECT  session_id
        FROM    formation_page_hits
        WHERE   progress IN (2, 4, 7)
                AND datetime >= '2011-03-23'
                AND datetime < '2011-03-24'
        GROUP BY
                session_id
        HAVING  COUNT(DISTINCT progress) = 3
        ) q

Create a composite index on (session_id, datetime, progress) for this to work fast.

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • Returns 0 rows - there will be a variable number of records for each `session_id`, covering more than 3 unique `progress` values. What purpose does the q have at the end of your query? – Jon Mar 28 '11 at 16:21
  • @Jon: corrected a typo, but this should not really mattter. You should adjust the constant to match the number of items in the list. `q` is subquery alias, required by the parser. Also, this query cannot return `0` rows, though it may return `1` row with `0` in it. Is it what happens? – Quassnoi Mar 28 '11 at 16:25
  • My bad - not sure what typo you corrected but its working A-OK now, after I added in the DATE() function around `datetime`. Thanks very much! – Jon Mar 28 '11 at 16:33
  • @Jon: instead of using `DATE`, use the syntax in my update. It's more index-friendly. – Quassnoi Mar 28 '11 at 16:36
  • @Quassnoi is there any way to force the order of the appearance of the values in the `progress` list, so that an entry with `progress=2` has to be found before `progress=4` and before `progress=7` – Jon Mar 29 '11 at 08:14
  • @Jon: sure, please post it as a separate question. – Quassnoi Mar 29 '11 at 09:07
  • @Quassnoi thanks - http://stackoverflow.com/questions/5471399/mysql-force-order-of-results-found-to-match-order-of-in-clause – Jon Mar 29 '11 at 10:57