1

I have a site where I record client metrics in a SQL Server 2008 db on every link clicked. I have already written the query to get the daily total clicks, however I want to find out how many times the user clicked within a given timespan (ie. within 5 seconds).

The idea here is to lock out incoming IP addresses that are trying to scrape content. It would be assumed that if more than 5 "clicks" is detected within 5 seconds or the number of daily clicks from a given IP address exceeds some value, that this is a scraping attempt.

I have tried a few variations of the following:

-- when a user clicked more than 5 times in 5 seconds
SELECT DATEADD(SECOND, DATEDIFF(SECOND, 0, ClickTimeStamp), 0) as ClickTimeStamp, COUNT(UserClickID) as [Count]
FROM UserClicks
WHERE DATEDIFF(SECOND, 0, ClickTimeStamp) = 5
GROUP BY IPAddress, ClickTimeStamp

This one in particular returns the following error:

Msg 535, Level 16, State 0, Line 3 The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.

So once again, I want to use the seconds datepart, which I believe I'm on the right track, but not quite getting it.

Help appreciated. Thanks.

-- UPDATE --

Great suggestions and helped me think that the approach is wrong. The check is going to be made on every click. What I should do is for a given timestamp, check to see if in the last 5 seconds 5 clicks have been recorded from the same IP address. So it would be something like, count the number of clicks for > GetDate() - 5 seconds

Trying the following still isn't giving me an accurate figure.

SELECT COUNT(*)
FROM UserClicks
WHERE ClickTimeStamp >= GetDate() - DATEADD(SECOND, -5, GetDate())
ElHaix
  • 12,846
  • 27
  • 115
  • 203
  • The issue with the datediff is that it is overflowing (apparently). You have your startdate as 0, and the `ClickTimeStamp` as whatever datetime is recorded. Anything over approximately 68 years time gap with the datepart as seconds will have an overflow, since datediff returns an int. See more info on it at: http://stackoverflow.com/questions/1275208/sql-server-datediff-function-resulted-in-an-overflow – king14nyr Dec 05 '11 at 17:34
  • There's a logic problem with the request as well. Say you have an IP that clicks at 10:00:00 AM and then 3 times more between 10:00:01 and 10:00:04. Then again at 10:00:07. Would your output contain a count of (1+3) ignoring the 10:00:07 or would it be 1+3+1 since 10:00:07 is within 5 seconds of 10:00:04? Depending on response would determine the method to get the results. – xQbert Dec 05 '11 at 18:03

3 Answers3

1

Hoping my syntax is good, I only have oracle to test this on. I'm going to assume you have an ID column called user_id that is unique to that user (is it user_click_id? helpful to include table create statements in these questions when you can)

You'll have to preform a self join on this one. Logic will be take the userclick and join onto userclick on userId = userId and difference on clicktimestamp is between 0-5 seconds. Then it's counting from the subselect.

select u1.user_id, u1.clicktimestamp, u2.clicktimestamp
from userclicks uc1
left join user_clicks uc2  
    on u2.userk_id = u1.user_id
    and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) <= 5
    and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) > 0

This select statement should give you the user_id/clicktimestampe and 1 row for every record that is between 0 and 5 seconds apart from that clicktimestamp from the same user. Now it's just a matter of counting all user_id,u1.clicktimestamp combinations and highlighting the ones with 5 or more. Take the above query and turn it into a subselect and pull counts from it:

select u1.user_id, u1.clicktimestamp, count(1)
from 
(select u1.user_id, u1.clicktimestamp
from userclicks uc1
left join user_clicks uc2  
    on u2.userk_id = u1.user_id
    and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) <= 5
    and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) > 0) a
group by u1.user_id, u1.clicktimestamp
having count(1) >= 5

Wish I could verify my syntax on a MS machine....there might be some typo's in there, but the logic should be good.

Twelfth
  • 7,070
  • 3
  • 26
  • 34
  • I took a look at this approach, for getting all the times when clicks are < 5 in a 5 second span, however I get a bunch of "... could not be bound" errors. – ElHaix Dec 05 '11 at 20:26
0

Assuming log entries are only entered for current activity -- that is, whenever a new row is inserted, the logged time is for that point in time and never for any prior point in time -- then you should only need to review data for a set period of time, and not have to review "all data" as you are doing now.

Next question is: how frequently do you make this check? If you are concerned with clicks per second, then something between "once per hour" and "once every 24 hours" seems reasonable.

Next up: define your interval. "All clicks per IPAddress within 5 seconds" could go two ways: set window (00-04, 05-09, 10-14, etc), or sliding window(00-04, 01-05, 02-06, etc.) Probably irrelevant with a 5 second window, but perhaps more relevant for longer periods (clicks per "day").

With that, the general approach I'd take is:

  • Start with earliest point in time you care about (1 hour ago, 24 hours ago)
  • Set up "buckets", means by which time windows can be identified (00:00:00 - 00:00:04, 00:00:05 - 00:00:09, etc.). This could be done as a temp table.
  • For all events, calculate number of elapsed seconds since your earliest point
  • For each bucket, count number of events that hit that bucket, grouped by IPAddress (inner join on the temp table on seconds between lowValue and highValue)
  • Identify those that exceed your threshold (having count(*) > X), and defenestrate them.
Philip Kelley
  • 39,426
  • 11
  • 57
  • 92
  • Does this mean the user has to do 5 clicks in the time frame defined? what if the user clicked on 00:00:02 , 03, 04, 05,06? THat wouldn't fit into one bucket. I think he needs a self join for this – Twelfth Dec 05 '11 at 19:13
  • That's where the "sliding window" idea comes in. It's trickier, in that you have to take every second in which something happened, and check from it through the next 5 seconds. (Possible but complex, which is why I didn't spell it out--buckets are easier.) – Philip Kelley Dec 05 '11 at 22:36
  • ...and then I read your post, which does what I just yakked about. I'd give it +1, but neither of us sovled his problem... – Philip Kelley Dec 05 '11 at 22:37
  • Ah yes, the range would just have to be incremented by 1 seconds for a 5 second period. Makes sense. – Twelfth Dec 06 '11 at 20:20
0

An answer for your UPDATE: the problem is in the third line of

SELECT COUNT(*)
 FROM UserClicks
 WHERE ClickTimeStamp >= GetDate() - DATEADD(SECOND, -5, GetDate()) 

GetDate() - DATEADD(SECOND, -5, GetDate()) is saying "take the current date time and subtract (the current date time minus five seconds)". I'm not entirely sure what kind of value this produces, but it won't be the one you want.

You still want some kind of time-period, perahps like so:

SELECT count(*)
 from UserClicks
 where IPAddress = @IPAddress
  and ClickTimeStamp between getdate() and dateadd(second, -5, getdate())

I'm a bit uncomfortable using getdate() there--if you have a specific datetime value (accurate to the second), you should probably use it.

Philip Kelley
  • 39,426
  • 11
  • 57
  • 92
  • Hmm, I see. I tried this...clicked on a few links, ran the query (with -5, not 5), gives 0. – ElHaix Dec 06 '11 at 02:17
  • Had to swap the first GETDATE() and DATEADD() - that works. However, with that I can achieve the same with :: WHERE ClickTimeStamp > dateadd(second, -5, getdate()) :: Is there a performance difference in using ">" and BETWEEN? – ElHaix Dec 06 '11 at 16:26
  • BETWEEN would be slower as there's more to check -- though, depending on indexing and other factors, mere humans might not notice the extra few milliseconds. I used BETWEEN because I could not be certain you would end up using getdate(). As for the ordering of the two, that (as ever) is what comes from my not being able to test my SO code. – Philip Kelley Dec 06 '11 at 17:01