5

I am trying to find an elegant solution in the form of a SQL query for the following problem.

New records will be inserted in the Log table. I need to detect any new records (inserted in the last hour) that I haven't seen before and generate an alert (e.g. # of these records > 0)

ID, Url, DOB
1, site1.com/page1, "5/06/2012 20:01"
2, site2.com/page2, "5/06/2012 21:20"
3, site1.com/page1, "6/06/2012 10:05"

If "now" is 6/06/2012 10:40 - I see that there was 1 new record (id=3) inserted but I don't want to generate an alert because we have seen this URL before (id=1).

if we have 4, site3.com/pageX, "6/06/2012 10:08" then I want to generate an alert (return count=1) because this row was inserted in the last hour and we haven't seen it before.

What is the best way to implement it? ideally without nested queries

GarethD
  • 68,045
  • 10
  • 83
  • 123
DmitryK
  • 5,542
  • 1
  • 22
  • 32
  • Can be anything really. Ideally an integer that represents a record count. I need this for the monitoring system and am quite flexible in terms of what this query returns. – DmitryK Jun 07 '12 at 12:44
  • How do you intend to implement an "alert" from a SQL statement? A new record in a table named `alerts`, perhaps? – felixgaal Jun 07 '12 at 12:44
  • My monitoring system can run SQL queries and receive results back. I don't want to create additional tables. Has to be a readonly operation without writing into the database – DmitryK Jun 07 '12 at 12:46
  • if the same url appears two times in the last hour and not before? and by "in the last hour" do you mean starting from hh:00, or current time minus 60 minutes? – guido Jun 07 '12 at 12:48
  • What DBMS are you using? – GarethD Jun 07 '12 at 12:53
  • @guido - if it appears twice ideally I would like to see "2" but "1" (as a non-zero flag) would be OK too. – DmitryK Jun 07 '12 at 12:59

5 Answers5

5

I think this is what you are after. This will retrieve new entries in the last hour (Where new means the same URL has not been visited unitl the last hour)

SELECT  *
FROM    Log
WHERE   DOB > DATEADD(HOUR, -1, CURRENT_TIMESTAMP)
AND     NOT EXISTS
        (   SELECT  1
            FROM    Log T1
            WHERE   T1.URL = Log.URL 
            AND     T1.DOB < DATEADD(HOUR, -1, CURRENT_TIMESTAMP)
        )

Working example on SQL Fiddle

EDIT

Just seen a comment that you only need a count:

SELECT  COUNT(*)
FROM    Log
WHERE   DOB > DATEADD(HOUR, -1, CURRENT_TIMESTAMP)
AND     NOT EXISTS
        (   SELECT  1
            FROM    Log T1
            WHERE   T1.URL = Log.URL 
            AND     T1.DOB < DATEADD(HOUR, -1, CURRENT_TIMESTAMP)
        )

EDIT 2

I am not sure why there is the requirement of only being a single select, however, the closest I can get to a single select is this:

SELECT  COUNT(*)
FROM    (   SELECT  *, MIN(DOB) OVER(PARTITION BY URL) [FirstViewed]
            FROM    Log
        ) Log
WHERE   FirstViewed >= DATEADD(HOUR, -1, CURRENT_TIMESTAMP)

This will still return 2 if the same page has been visited twice in the last hour.

http://sqlfiddle.com/#!3/5a8bc/1

GarethD
  • 68,045
  • 10
  • 83
  • 123
  • Yes, this one works. Thanks Gareth. Will see if anyone else can come up with a solution without nested queries (i.e. as part of a single SELECT). If not - will accept your answer – DmitryK Jun 07 '12 at 13:26
  • "single select" is just searching for elegance. Not compulsory really. ;) – DmitryK Jun 07 '12 at 13:55
2

This one does something alternative, first search unique url by grouping, then extract those in the last hour.

SELECT x1.*
FROM
  (SELECT URL,
          COUNT(ID) AS urlcount,
          MAX(DOB) AS uniqueurl
   FROM Log
   GROUP BY URL HAVING count(ID) = 1
   OR MIN(DOB) > dateadd(HOUR ,-1 , CURRENT_TIMESTAMP)) AS x1
WHERE x1.uniqueurl > dateadd(HOUR ,-1 , CURRENT_TIMESTAMP);

http://sqlfiddle.com/#!3/250e0/45/0

I cannot figure out if this has acceptable performance without looking at an explain, but i think the sort operation involved in the group by could be a bottleneck

guido
  • 18,864
  • 6
  • 70
  • 95
  • I am not sure this works exactly as required. If the same page was visited twice in the last hour the OP wants this to show as 2 alerts, however your `HAVING COUNT(ID) == 0` means this would show no alerts [Example Here](http://sqlfiddle.com/#!3/6d0f8/2) – GarethD Jun 07 '12 at 13:48
  • i was fixing it while you were writing the comment :) – guido Jun 07 '12 at 13:49
  • It is also worth pointing out that it is not "MsSQL Syntax not accepting columns without aggregate data". This implies this is a quirk of MSSQL, whereas this is SQL Standard and applies to all DBMS I know of except for MySQL, which (abhorrently in my opinion) allows columns in the select list that are not contained in an aggregate or group by. This to me, is an open invitation for data errors if columns are inadvertantly left out of the group by clause. – GarethD Jun 07 '12 at 13:52
1

Without nested query (SQLFiddle):

SELECT COUNT(DISTINCT T0.URL) 
FROM Log AS T0 
LEFT OUTER JOIN Log AS T1 ON 
    T1.URL = T0.URL 
    AND T1.DOB < DATEADD(HOUR, -1, CURRENT_TIMESTAMP) 
WHERE 
    T0.DOB > DATEADD(HOUR, -1, CURRENT_TIMESTAMP) 
    AND T1.ID IS NULL

But it really is the same solution as GarethD, performance wise.

  • 1
    `LEFT JOIN/IS NULL` performs worse in SQL-Server than `NOT EXISTS' http://stackoverflow.com/questions/2246772/whats-the-difference-between-not-exists-vs-not-in-vs-left-join-where-is-null – GarethD Jun 07 '12 at 13:55
0

Try this:

SELECT DISTINCT a.id, a.url, a.dob
FROM Log a JOIN Log b ON (a.url = b.url)
WHERE UNIX_TIMESTAMP(NOW())-UNIX_TIMESTAMP(a.DOB)<=3600 
  AND UNIX_TIMESTAMP(NOW())-UNIX_TIMESTAMP(b.DOB)>3600;

It should return all the records that follow the pattern you specified in the question.

Observe that I use UNIX_TIMESTAMP to get the dates translated to seconds, so the substract will return a difference in time expressed as an amount of seconds. And the comparison must be made against 3600 seconds.

EDIT:

The sentence has been corrected. But it's for MySQL (I didn't see the sql-server2005 tag)

felixgaal
  • 2,403
  • 15
  • 24
  • 1
    This does the opposite of what is required. It will only return rows where the same URL has been visited outside of the last hour. These rows should be excluded and only rows returned where the URL has been visited for the first time in the last hour. This is also not SQL-Server Syntax. – GarethD Jun 07 '12 at 13:19
  • Sorry, I didn't see the sql-server2005 tag! – felixgaal Jun 07 '12 at 17:22
-1
select distinct(a.url) from tbl a, tbl b where a.dob>(now-hour) and b.dob<=(now-hour) and a.url=b.url; 

(replace time manipulation with something from your db of choice. index the urls and dob)

Also hope that your database is sensible enough to do the dob-comparison before join and join using indexes.

Markus Mikkolainen
  • 3,397
  • 18
  • 21