1

I am currently trying to write a query from a table that list out users surfing behaviour. The table looks like the one below

**RecordID  RespondentID    DeviceID    UTCTimestamp    Domain**
1   01faca75-1216-4a55-b43c-9d64ade852f7    4DF57C06-F0BD-4779-8983-37A8B02E5EDF    06/11/2017 10:21    goodreads.com 
2   01faca75-1216-4a55-b43c-9d64ade852f7    4DF57C06-F0BD-4779-8983-37A8B02E5EDF    06/11/2017 10:21    goodreads.com 
3   01faca75-1216-4a55-b43c-9d64ade852f7    4DF57C06-F0BD-4779-8983-37A8B02E5EDF    06/11/2017 10:21    gr-assets.com 
4   01faca75-1216-4a55-b43c-9d64ade852f7    4DF57C06-F0BD-4779-8983-37A8B02E5EDF    06/11/2017 10:21    gr-assets.com 
5   01faca75-1216-4a55-b43c-9d64ade852f7    4DF57C06-F0BD-4779-8983-37A8B02E5EDF    06/11/2017 10:23    itunes.apple.com 
6   01faca75-1216-4a55-b43c-9d64ade852f7    4DF57C06-F0BD-4779-8983-37A8B02E5EDF    06/11/2017 10:23    itunes.apple.com 
7   01faca75-1216-4a55-b43c-9d64ade852f7    4DF57C06-F0BD-4779-8983-37A8B02E5EDF    06/11/2017 10:51    samplicio.us 
8   01faca75-1216-4a55-b43c-9d64ade852f7    4DF57C06-F0BD-4779-8983-37A8B02E5EDF    06/11/2017 10:51    samplicio.us

Thanks to everyone's help I managed to get this.

RecordID RespondentID UTCTimestamp Source Domain To Domain RecordID

2   01faca75-1216-4a55-b43c-9d64ade852f7    06/11/2017 10:21    goodreads.com   gr-assets.com   3
4   01faca75-1216-4a55-b43c-9d64ade852f7    06/11/2017 10:21    gr-assets.com   itunes.apple.com    5
6   01faca75-1216-4a55-b43c-9d64ade852f7    06/11/2017 10:23    itunes.apple.com    samplicio.us    7

The To Domain is the value from the next row where the domain name is different.

Problem While this looks correct, we actually have skipped the entire first record. This is because, given the data set, the first row "domain" is joined to the second row "domain" and we skipped it. Row 2 is combined with Row 3 hence the first resulting record shows RecordID 2. I would like to fine tune this further. My result should should start from RecordID 1 and skipping RecordID 2 since the domain is the same, hence the result should show

RecordID RespondentID UTCTimestamp Source Domain To Domain

1   01faca75-1216-4a55-b43c-9d64ade852f7    06/11/2017 10:21    goodreads.com   gr-assets.com   
3   01faca75-1216-4a55-b43c-9d64ade852f7    06/11/2017 10:21    gr-assets.com   itunes.apple.com    
5   01faca75-1216-4a55-b43c-9d64ade852f7    06/11/2017 10:23    itunes.apple.com    samplicio.us    

I tried skipping the RecordID 2, however, an SQL error 'prev_nane' does not exist was encountered.

SELECT t1."RecordID", t1."RespondentID", t1."UTCTimestamp", t1."Domain" as "Source Domain", t2."Domain" as "To Domain" , t2."RecordID", lag(t1."Domain",1) over (order by t1."RecordID") as prev_name
from public."Traffic - Mobile" as t1
  join public."Traffic - Mobile" as t2 on t2."RespondentID" = t1."RespondentID" AND t2."DeviceID"=t1."DeviceID" AND t2."RecordID"=t1."RecordID"+1  And t1."Domain"<>T2."Domain" AND t2."UTCTimestamp">=t1."UTCTimestamp" AND t2."Sequence"-t1."Sequence"=1 and t1."RecordID"<13 AND t1."Domain"<>prev_name;

What did I do wrong?

Also my final result I want to achieve is below RecordID RespondentID UTCTimestamp Source Domain To Domain Final Destination

1   01faca75-1216-4a55-b43c-9d64ade852f7    06/11/2017 10:21    goodreads.com   gr-assets.com   samplicio.us
3   01faca75-1216-4a55-b43c-9d64ade852f7    06/11/2017 10:21    gr-assets.com   itunes.apple.com    samplicio.us
5   01faca75-1216-4a55-b43c-9d64ade852f7    06/11/2017 10:23    itunes.apple.com    samplicio.us    samplicio.us

An additional column called 'Final destination'. This is to allow me to group the 3 transactions together as the path taken to reach samplicio.us.

Thanks in advance.

  • If timestamp on recordid 2 and record 3 are 10:22 and time on record 4 is 10:23, would the expected result changes? I am trying to figure out when to "skip" a record? – kc2018 Jan 05 '18 at 04:50
  • Take a look at https://stackoverflow.com/questions/18290572/postgres-next-previous-row-sql-query – kc2018 Jan 05 '18 at 05:03
  • Yes, I notice that too. So my skip will be this. If record 1 shows domain A, record 2 show domain A, record 3 shows domain B, the result will show From Source, Domain A , to Source Domain B. What I am trying to do is capture the moment someone enter a site and then move on to another site. What they do within a site is not important at this moment. –  Jan 05 '18 at 06:17
  • Is a user identified by RespondentID and DeviceID? Is the ID system generated unique identifier and can be used to tell which is the earlier record for each user? – kc2018 Jan 05 '18 at 06:41
  • Yes a user is identified by RespondentID and DeviceID. The timestamp is generated by the system that capture this movement. Some movement are longer ie, we can see a few seconds or minutes but they are now sorted based on sequence. The same respondent can have a different device and their movement on those devices are treated as if they are different users. –  Jan 05 '18 at 07:20
  • Where is the sequence number? – kc2018 Jan 05 '18 at 07:54
  • We need to derive the Sequence No based on the time stamp and order. The data as we get it is assumed to be sorted by time, group by respondentID. I have a recordID which serialise data. For all internt, the RecordID list the events chronologically. –  Jan 05 '18 at 08:46
  • @Kc2018, would this be simpler using a loop function. If I was to write this in VB.net or C#, I would compare the strings and just go to next if the condition is not met. I understand that PLSQL can do that. –  Jan 05 '18 at 10:17
  • Wrote a query http://sqlfiddle.com/#!17/dc248/3 – kc2018 Jan 05 '18 at 18:57
  • Take a look at the output the sqlfiddle. If that works, I will explain the code. Basically it uses windows function LEAD to get the next domain and LAST_VALUE to generate the final destination. Should a user session be included in the result if there is only one activity (one domain) - see record 11 in the test data. As it stands now, it is not in the result. – kc2018 Jan 05 '18 at 19:07
  • Yes, that's exactly what I want. –  Jan 06 '18 at 05:34
  • Basically am I correct T1 was created as temporary table and another round of processing was done on T1? –  Jan 06 '18 at 05:43
  • Yes, t1 is a Common Table Expression https://www.postgresql.org/docs/9.1/static/queries-with.html – kc2018 Jan 06 '18 at 05:47
  • Just a question, what will happen if a respondent have data over many days? For now we are making an assumption, that 1 complete transaction is done over 10-20 minutes. The same user can come back again a few hours later, and we consider that seperately. –  Jan 06 '18 at 05:52
  • As it stands now, the movements over many days from the same device and respondent would be grouped as one set of movement for that respondent. If the same user come back after a few hours is considered as a NEW set of movement, you will need to identify the criteria (the time frame: how many hours or minutes) and modified the query. – kc2018 Jan 06 '18 at 06:11
  • Can you explain this part of the code? partition by respondentid, deviceid order by user_seq rows between unbounded preceding and unbounded following ) as final_domain –  Jan 06 '18 at 07:34
  • It defines how to group/partition the records and the grouping/partitioning is by respondentid and deviceid. Within each partition, the records are to be sort by user_seq. Unbounded preceding is the very first row within the partition whereas unbounded following is the very last row within that partition. https://stackoverflow.com/questions/42299101/window-function-last-valueorder-by-asc-same-as-last-valueorder-by-de – kc2018 Jan 06 '18 at 09:14
  • Thanks for the detailed explaination, If I want to group the transactions to ensure that each sequence is within 60 minutes (i,e from the first time to the last row is 60 minutes, what do I need to do that? –  Jan 06 '18 at 11:27
  • I would suggest you close this question and raise a new question for the within 60 requirement so that more people can help you. Sounds to me you want a new partition when the time gap between two records is greater than 60 mins. – kc2018 Jan 06 '18 at 12:41
  • Thanks KC. I will close this question. It is a great starting point. –  Jan 06 '18 at 15:42

1 Answers1

0

Try this:

with t1 as
(
select 
recordid,
respondentid,
deviceid,
utctimestamp,
domain,
row_number() over (partition by 
                   respondentid,
                   deviceid 
                   order by 
                   utctimestamp,
                   recordid) as user_seq,
row_number() over (partition by 
                   respondentid,
                   deviceid,
                   domain
                   order by 
                   utctimestamp,
                   recordid) as user_domain_seq
from traffic_mobile)

select *
from
(
select 
recordid,
respondentid,
deviceid,
utctimestamp,
domain,
lead(domain) over ( partition by 
                    respondentid,
                    deviceid order by 
                   user_seq) as next_domain,
last_value(domain) over( partition by
                      respondentid,
                      deviceid order by user_seq 
                      rows between unbounded preceding
                      and unbounded following ) 
                      as final_domain
from t1
where 
user_domain_seq = 1 ) t2
where t2.next_domain is not null

sqlfiddle: sqlfiddle.com/#!17/dc248/3

PS. The query will not return a line for users with only 1 entry on traffic_mobile table. The query would need to be improved to include them if it is a requirement.

kc2018
  • 1,440
  • 1
  • 8
  • 8