1

Have a table in sql server 2008 in which page visits are captured, something similiar to IIS logs or Google Analytics, nothing too fancy.

The table has columns such as:

[ID], [Username], [URL], [Created], [Browser], [BrowserVersion], [HostName], [IPAddress], [OperatingSystem], [UrlReferrer]

Below picture illustrates a query output sorted by creation date(Created column) in descending format representing/listing all page hits. URLs and Usernames are omitted for privacy purposes.

Query

What I would like to do is to write a query to group all rows in one where time difference from last record occurance is less than 30 minutes for an identical IPAddress and Username or in other words select/return only the last record for the same IPAddress and Username and eliminate all other rows prior to that.

Desired outcome is only those rows with the an arrow next to them(image below):

enter image description here

Amir Kayvan
  • 36
  • 1
  • 5
  • 2
    What is your desired output? – sgeddes May 02 '13 at 23:50
  • @sgeddes Imagine a 'blob' of rows that are defined by no row in the blob being more than 30 minutes from every single other row. He wants all of those 'blobs' per IPAddress and Username to be grouped rows. – Patashu May 02 '13 at 23:54
  • The problem is that you'll need to bracket time into 30-minute chunks and group them by IP and TimeChunk. You can't just say "within 30 minutes of each other" because if records are spread over more than 30 minutes it won't know how to group them. – Russell Fox May 03 '13 at 00:04
  • See this: http://stackoverflow.com/questions/5002661/how-to-group-time-by-hour-or-by-10-minutes – Russell Fox May 03 '13 at 00:05
  • 1
    @Russell Fox The linked question is a different question. He wants to bucket/histogram rows by what 30 minute interval they fall in. Amir Kayvan's question is to 'blob' all rows together as long as they're closer than 30 minutes together - e.g. session detection. – Patashu May 03 '13 at 00:16
  • That's correct @Patashu, the end result I'm looking for is what you described here. – Amir Kayvan May 03 '13 at 03:12

3 Answers3

1

First, your requirement can be interpreted more than one way so let me stat what I think you are wanting... I think you are saying a session ends when a given IP address has 30 minutes of inactivity. So if an IP address hits a site every minute for 2 hours and then takes a 30 minute break that represents one session. Assuming that is what you intended...

You can use LEAD and LAG to help you identify sessions. My test data consisted of an Id column, an IPAddress column, and a Created column. Here is the code, explanation follows...

WITH t1 AS 
(
    SELECT 
        * 
        , DATEDIFF(minute, LAG(Created, 1, 0) OVER (PARTITION BY IPAddress ORDER BY Created), [Created]) AS SinceLastAccess
    FROM 
        IISLog
), sessionStarts AS 
(
    SELECT 
        * 
    FROM 
        t1
    WHERE 
        SinceLastAccess >= 30
), sessionInfo AS 
(
    SELECT 
        IPAddress
        , Created AS SessionStart
        , LEAD(Created, 1, '2025-01-01') OVER (PARTITION BY IPAddress ORDER BY CREATED) AS SessionEnd
    FROM 
        sessionStarts
)

SELECT * FROM sessionInfo

The first CTE (t1) selects the data, but adds a column called SinceLastAccess. This new column uses the LAG function to look at the value in the previous row and calculate how many minutes have passed. The PARTITION BY constrains this calculation to each IP address.

The second CTE (sessionStarts) simply selects those rows from t1 where the SinceLastAccess value is greater than 30. This effectively tells us the beginning of every session.

Finally, the `sessionInfo' CTE builds upon the second. Using the LEAD function we look forward to see where the next session begins. This value is taken to be when the current row's session ends. What we finally end up with is an IP address, session start, and session end. Now that you have these, it should be easy to JOIN this to the original table and GROUP it up.

dazedandconfused
  • 3,131
  • 1
  • 18
  • 29
  • Bah! Forgot my test machine is on 2012. Check out this link on stimulating lead/lag in 2008, maybe it will help. http://www.rafael-salas.com/2008/05/t-sql-lead-and-lag-functions.html?m=1 – dazedandconfused May 03 '13 at 10:11
0

This should get you partway there. It looks for all entries that don't have another within 30 mins after (or vice versa, looking for entries that don't have another 30 mins before it)

This should give you the date/time of the end of each "blob" of entries.

SELECT m1.IpAddress, m1.DateCreated
FROM MyLog m1 LEFT OUTER JOIN MyLog m2 
                 ON m1.IpAddress = m2.IpAddress AND 
                    DATEDIFF(minute, m1.DateCreated, m2.DateCreated) BETWEEN 0 AND 30
WHERE m2.DateCreated IS NULL

This should give you the date/time of the start of each "blob".

SELECT m1.IpAddress, m1.DateCreated
FROM MyLog m1 LEFT OUTER JOIN MyLog m2 
                 ON m1.IpAddress = m2.IpAddress AND 
                    DATEDIFF(minute, m2.DateCreated, m1.DateCreated) BETWEEN 0 AND 30
WHERE m2.DateCreated IS NULL
Dylan Smith
  • 22,069
  • 2
  • 47
  • 62
  • thanks for your answer; However, this is a long running query which does not fulfill what I'm looking for. End result I'm looking for is close to what Patashu described. – Amir Kayvan May 03 '13 at 03:11
  • not sure I understand, Patashu said you *don't* want a histogram, and want to "blob" them together which is what I did. – Dylan Smith May 03 '13 at 07:09
0

The great thing about Microsoft SQL is that they have the non-standard 'select top n ...' possibility; and that is what you can use here:

select * from

(
select
id,
ipAdress,
created,
(
select
top 1
created 
from tbl as tPrevious
where tPrevious.ipAdress=t.ipAdress
and tPrevious.created<t.created
order by created desc
) as previousCreated
from tbl as t
) as joined

where
previousCreated is not null 
and DATEDIFF(min, created,previousCreated) between 0 and 30
Chris Allen
  • 605
  • 3
  • 4