0

I have quite a large DB, but I've simplified it for the purpose of this question:

A sample chunk of my DB

Basically, every time a user clicks something on my site, it gets logged as a row in my DB: the UserID field is a cookie that is used to identify the user, and Stuff and MoreStuff are data about the click. Note that, obviously, these are likely to be different every time, "Foo" and "Bar" is just a representation.

What I want to do is this: with an SQL query, filter out either all of the first visits, or all of the repeated visits (I assume if I can do one, I can invert my filter for the other). So, if I were to filter out all of the repeat visits on my sample, I'd get this:

The filtered DB

with green representing the selected rows and red representing the rejected ones.

How can I do this with just SQL?

Community
  • 1
  • 1
Bluefire
  • 13,519
  • 24
  • 74
  • 118

3 Answers3

3

You can do this with a group by and min(timestamp) table. Then put that in a view or temporary aliased table and join.

Example - get first visits timestamp:

    SELECT userID,
       min(TIMESTAMP)
FROM info
GROUP BY userID

Example - get all information:

    SELECT *
  FROM info t,    
  (SELECT userID,
          min(TIMESTAMP) as timestamp
   FROM info
   GROUP BY userID) AS t2
WHERE t.userID = t2.userID
  AND t.TIMESTAMP = t2.TIMESTAMP;

SQLFiddle: http://sqlfiddle.com/#!2/6fd238/3

Menelaos
  • 23,508
  • 18
  • 90
  • 155
2

If you want just the first visits:

select t.*
from table t
where not exists (select 1
                  from table t2
                  where t2.id = t.id and t2.timestamp < t.timestamp
                 );

This says: "Get me all rows from the table where there is no row with the same id and a smaller timestamp". This is equivalent to getting the smallest value for each id.

For performance, you want an index on table(id, timestamp).

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

You could use aggregate function COUNT() and then HAVING statement like:

 SELECT userID, COUNT(userID)
 FROM tbl
 GROUP BY userID
 HAVING COUNT(userID) >= 2

You can then re-use the above query if you want to filter out who are the repeat visitors like:

SELECT * FROM tbl
WHERE EXISTS
(
 SELECT userID, COUNT(userID)
 FROM tbl
 GROUP BY userID
 HAVING COUNT(userID) >= 2
)
GROUP BY userID

If you want to see first-time visitors you could change HAVING COUNT(userID) >= 2 to HAVING COUNT(userID) = 1 or of course instead of EXISTS use NOT EXISTS.

See Demo

Edper
  • 9,144
  • 1
  • 27
  • 46
  • 1
    There is an error in the SQL. Your SQL is equivalent to `SELECT * FROM tbl GROUP BY userID;` . Please see at: http://www.sqlfiddle.com/#!2/26ce3/9 – Menelaos Jul 09 '14 at 14:04
  • 1
    Additionally, (if it worked) this assumes the order of inserts is the actual order of clicks. If MySQL (or the webapp) delays in any way the order can be different than the actual timestamps. Another is, what if two clicks endup having same timestamp due to AJAX delays. Very very unlikely... but worth mentioning the weakness. Test: sqlfiddle.com/#!2/4b475/3 – Menelaos Jul 09 '14 at 14:05
  • @maythesource.com: On your first comment you my 2nd SQL wrong. It should not be `NOT EXISTS (...COUNT(userID)=1)` but `NOT EXISTS (...COUNT(userID)>=2)`. However, on your 2nd comment I think he was looking for the first time visitors rather than first date that the user visits? But you could right on this one. If so that's the case the `MIN(timestamp)` function would be the right one. – Edper Jul 09 '14 at 22:22
  • Your sub query is not dependent on the outer query. It is equivalent to SELECT * FROM tbl GROUP BY userID. Additionally it is `very bad` practice to dependent on default mysql ordering. – Menelaos Jul 13 '14 at 23:55
  • @Bluefire Please review the comments and links. Also: http://stackoverflow.com/questions/24728091/mysql-group-by-behavior-when-using-a-derived-table-with-order-by – Menelaos Jul 14 '14 at 17:20