3

I have a dataset that looks like the following:

enter image description here

For each OwnerID, I'd like to calculate the difference in column creationtime for the current record and the next record (for the same ownerID), in the form of a new column TimeDiff. I believe a self join would be required here, but I'm not sure how to use the self join to calculate the difference between the current record and the next record.

While doing this, the very last record for any ownerID can have a default value of 'NA' as there won't be a next record (for the same ownerID) to calculate the difference from.

Here's the query that I used to get this dataset:

    SELECT DISTINCT ga.ownerid,
         mr.name,
         SPLIT_PART(SPLIT_PART(ga.activitydata,' ',2),',',1) AS Assignmentid,
         EXTRACT(YEAR FROM ga.creationtime) AS YEAR,
         EXTRACT(MONTH FROM ga.creationtime) AS MONTH,
         EXTRACT(DAY FROM ga.creationtime) AS DAY,
         EXTRACT(DOW FROM ga.creationtime) AS DOW,
         ga.creationtime,
         a.encodedid,
         a.name
  FROM flx2.groupactivities ga
    JOIN flx2.memberstudytrackitemstatus mstis ON SPLIT_PART (SPLIT_PART (ga.activitydata,' ',2),',',1) = mstis.assignmentid
    JOIN flx2.artifacts a ON mstis.studytrackitemid = a.id
    JOIN auth.memberhasroles mhr ON mhr.memberid = ga.ownerid
    JOIN flx2.memberroles mr ON mr.id = mhr.roleid
  WHERE ga.activitytype = 'assign'
  AND   ga.ownerid NOT IN (SELECT memberid FROM auth.memberhasroles WHERE roleid = 25)
  AND   a.artifacttypeid = 54
  AND   a.encodedid IS NOT NULL
  ORDER BY ga.ownerid,
           ga.creationtime,
           a.encodedid

I'm using Amazon Redshift to get this data.

Any help would be appreciated.

TIA!

UPDATE:

I used the method suggested by @systemjack . Here are the results that I get:

enter image description here

We can clearly notice here that the column encodedid is getting repeated for the same assignmentID (MAT.PRB.410, as highlighted in the image above), which shouldn't be the case. This wasn't happening without the LEAD function, in the query mentioned above. Here's the updated query that I am using (only has an extra LEAD function):

SELECT DISTINCT ga.ownerid,
       mr.name,
       SPLIT_PART(SPLIT_PART(ga.activitydata,' ',2),',',1) AS Assignmentid,
       EXTRACT(YEAR FROM ga.creationtime) AS YEAR,
       EXTRACT(MONTH FROM ga.creationtime) AS MONTH,
       EXTRACT(DAY FROM ga.creationtime) AS DAY,
       EXTRACT(DOW FROM ga.creationtime) AS DOW,
       ga.creationtime,
       LEAD(ga.creationtime,1) OVER (PARTITION BY ga.ownerid ORDER BY ga.creationtime) AS nexttime,
       a.encodedid,
       a.name
FROM flx2.groupactivities ga
  JOIN flx2.memberstudytrackitemstatus mstis ON SPLIT_PART (SPLIT_PART (ga.activitydata,' ',2),',',1) = mstis.assignmentid
  JOIN flx2.artifacts a ON mstis.studytrackitemid = a.id
  JOIN auth.memberhasroles mhr ON mhr.memberid = ga.ownerid
  JOIN flx2.memberroles mr ON mr.id = mhr.roleid
WHERE ga.activitytype = 'assign'
AND   ga.ownerid NOT IN (SELECT memberid FROM auth.memberhasroles WHERE roleid = 25)
AND   a.artifacttypeid = 54
AND   a.encodedid IS NOT NULL
ORDER BY ga.ownerid,
         ga.creationtime,
         a.encodedid LIMIT 1000

The values in the nexttime column also seem to be jacked up. It seems to be taking the next value in the creationtime column on ocassion. For example: In the 2nd record, the value of nexttime column should've been 2013-09-18 06:14:59 instead of 2014-01-18 12:16:49

Why are we getting more records than expected? How do I fix these problems?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Patthebug
  • 4,647
  • 11
  • 50
  • 91
  • can you alter the table and add another column? – DCR Feb 22 '17 at 20:19
  • I'm getting this data using the query I have posted in the question. So it's a resultset, and not a table itself. If you're asking for a column to be added to either of `flx2.groupactivities`, `flx2.memberstudytrackitemstatus`, `flx2.artifacts`, `auth.memberhasroles`, `flx2.memberroles`, then that's not possible. – Patthebug Feb 22 '17 at 20:24
  • In the case of the `encodeid`, you must be getting duplicate rows from one of the joins. The global DISTINCT hides that when you don't have the LEAD in there which makes the two records distinct since one gets its lead from the duplicate and the second gets it from the actual next record. They're out of order in the result because the `order by` doesn't include the nexttime and they have the same other values so it's arbitrary. I'd recommend separating your base query...can stick it in a WITH clause or sub-select and applying the LEAD and any aggregation after so the DISTINCT is effective. – systemjack Feb 23 '17 at 00:15

3 Answers3

2

Update: Does this look better?

with dataset as (
    SELECT DISTINCT ga.ownerid,
        mr.name,
        SPLIT_PART(SPLIT_PART(ga.activitydata,' ',2),',',1) AS Assignmentid,
        EXTRACT(YEAR FROM ga.creationtime) AS YEAR,
        EXTRACT(MONTH FROM ga.creationtime) AS MONTH,
        EXTRACT(DAY FROM ga.creationtime) AS DAY,
        EXTRACT(DOW FROM ga.creationtime) AS DOW,
        ga.creationtime,
        a.encodedid,
        a.name
    FROM flx2.groupactivities ga
    JOIN flx2.memberstudytrackitemstatus mstis ON SPLIT_PART (SPLIT_PART (ga.activitydata,' ',2),',',1) = mstis.assignmentid
    JOIN flx2.artifacts a ON mstis.studytrackitemid = a.id
    JOIN auth.memberhasroles mhr ON mhr.memberid = ga.ownerid
    JOIN flx2.memberroles mr ON mr.id = mhr.roleid
    WHERE ga.activitytype = 'assign'
        AND   ga.ownerid NOT IN (SELECT memberid FROM auth.memberhasroles WHERE roleid = 25)
        AND   a.artifacttypeid = 54
        AND   a.encodedid IS NOT NULL
)
select d.*,
    LEAD(creationtime,1) OVER (PARTITION BY ownerid ORDER BY creationtime) AS nexttime
from dataset d
ORDER BY ownerid, creationtime, encodedid, nextime
LIMIT 1000

Something like this (untested code) might work. Idea is to use the LEAD window function to get the creationtime of the following record for each owner, which will be null if it's the last record, and then use regular DATEDIFF to get the units you want. The CASE statement in the outer query handles the last record edge case and you can tweak that to get the result you want there.

select ownerid, creationtime,
    case when nextime is not null
        then datediff('second', creationtime, nextime)
        else datediff('second', creationtime, sysdate)
        end as timediff
from (
    select distinct ownerid, creationtime,
        lead(creationtime,1) over (partition by ownerid order by creationtime) as nexttime
    from yourdata
)
systemjack
  • 2,815
  • 17
  • 26
  • Thanks so much for your answer, it makes sense in theory. When I tried to implement it, I see some strange behavior. If you notice in the sample dataset that I have posted, the `creationtime` can be the same for an `assignmentID`. The column `nexttime` seems to be getting confused with that. After including the `LEAD` window function, I can see that I'm getting an increased resultset (more # of records) as compared to the original resultset. I will update the original question with the results after implementing the `LEAD` function. – Patthebug Feb 22 '17 at 22:25
  • If the assignmentID represents a group and can be aggregated...i.e. you only care about the difference in creationtime between distinct assignments, you can add distinct keyword into the sub-select. – systemjack Feb 22 '17 at 22:31
  • No, I don't necessarily care about the difference in times between different assignments. I just literally want the difference in time from the next record, for each record, for an `ownerID` (even if they have the same `assignmentID`). – Patthebug Feb 22 '17 at 22:45
  • I've also updated the question to add more details to it. – Patthebug Feb 22 '17 at 22:45
1

I personally see no declarative (pure SQL) way to achieve that. Sorry. You can't refer value in particular records (even whether it next or prev) in the set, and that is by nature.

So there's three ways I can see here:

1) Use Procedural Extension to SQL (MySQL has one too).

2) Get the whole set and process it externally, at the "client" (to RDBMS) side.

3) Add a timediff column to the table + AFTER INSERT/UPDATE trigger where you'd be calculating that difference and append the record with it.

Yuri G
  • 1,206
  • 1
  • 9
  • 13
  • MySQL does not support window functions, but there are hacks like this one for LAG: http://stackoverflow.com/questions/11303532/simulate-lag-function-in-mysql – systemjack Feb 22 '17 at 21:56
  • right, that's exactly what I said - "use procedural extention" :-) – Yuri G Feb 22 '17 at 22:10
0

So I finally figured out a way to achieve this. I made use of Dense_Rank() function and used the following query to get the result:

WITH t AS
(
  SELECT DISTINCT ga.ownerid,
         SPLIT_PART(SPLIT_PART(ga.activitydata,' ',2),',',1) AS Assignmentid,
         EXTRACT(YEAR FROM ga.creationtime) AS YEAR,
         EXTRACT(MONTH FROM ga.creationtime) AS MONTH,
         EXTRACT(DAY FROM ga.creationtime) AS DAY,
         EXTRACT(DOW FROM ga.creationtime) AS DOW,
         ga.creationtime,
         DENSE_RANK() OVER (PARTITION BY ga.ownerid ORDER BY ga.ownerid,ga.creationtime,a.encodedid) AS RowNum,
         a.encodedid,
         a.name
  FROM flx2.groupactivities ga
    JOIN flx2.memberstudytrackitemstatus mstis ON SPLIT_PART (SPLIT_PART (ga.activitydata,' ',2),',',1) = mstis.assignmentid
    JOIN flx2.artifacts a ON mstis.studytrackitemid = a.id
    JOIN auth.memberhasroles mhr ON mhr.memberid = ga.ownerid
    JOIN flx2.memberroles mr ON mr.id = mhr.roleid
  WHERE ga.activitytype = 'assign'
  AND   ga.ownerid NOT IN (SELECT memberid FROM auth.memberhasroles WHERE roleid = 25)
  AND   a.artifacttypeid = 54
  AND   a.encodedid IS NOT NULL
  ORDER BY ga.ownerid,
           ga.creationtime,
           RowNum,
           a.encodedid
)
SELECT top 100 t1.ownerid,
       t1.assignmentid,
       t1.year,
       t1.month,
       t1.day,
       t1.dow,
       t1.creationtime,
       t2.creationtime,
       datediff(day,t1.creationtime,t2.creationtime),
       t1.encodedid,
       t1.name
FROM t AS t1
  LEFT JOIN t AS t2
         ON t1.ownerid = t2.ownerid
        AND t1.rownum + 1 = t2.rownum
ORDER BY t1.ownerid,
         t1.creationtime,
         t1.rownum,
         t1.encodedid

This gave me the following:

enter image description here

Patthebug
  • 4,647
  • 11
  • 50
  • 91