When Joining a SQL Table to Itself, Return only One Result. Time Log Entry

Question

I am working with some data from Jira. In Jira there are "Issues" and each issue can go through states such as New, In Progress, Review, etc. I want to measure the time something stays in each state. It is possible for things to move back and forth and return to a state multiple times. Jira produces a log table which logs the event of the issue moving from one state to another. To measure how long it’s been in a state you need to have the entry from the log for that state (end) and the previous change of state (start).

I can’t return one entry for each state change. For Issues that moved between a state multiple times I get multiple rows.

I tried the gaps and islands approach. Also a select within the top select. Min or Max in the join was atrociously slow.

The desired result would be a column added to the table in the select which gives the duration for the State in the column ItemFromString. The date difference is between this entry’s Created date and the previous state change entry’s created date, which shows when the issue moved to this state. In the example data below the first entry for Assessment, History ID 436260, would be a duration of 9/19–9/14. When I join I get multiple entries for this History ID since there are multiple Assessments. I filter the join by Issue key and Item from/to String where they match; however, I need to also add a filter where it looks at any entries created before the current items created date and selects the most recent, or largest, one. This is where I am hung up.

Fields:

Created - This is when the log entry was created which is the date time it changed state from ItemFromString to ItemToString.
IssueCreated - This is when the issue the log is about, was created. For example, they start in the new state so we need this date to figure out how long it sat in new as the first log entry will be it moving from New to something else.
IssueKey and IssueID are almost the same thing, they are key ID's for the issue in a different table.
HistoryID is the key for each log entry in this table.

Assessment

IssueKey	HistoryID	IssueId	Created	IssueCreatedDate	ItemFromString	ItemToString
TPP-16	434905	208965	9/14/2022 14:33	9/14/2022 8:56	New	Assessment
TPP-16	436260	208965	9/19/2022 8:32	9/14/2022 8:56	Assessment	Internal Review
TPP-16	437795	208965	9/19/2022 16:11	9/14/2022 8:56	Internal Review	New
TPP-16	437796	208965	9/19/2022 16:11	9/14/2022 8:56	New	Assessment
TPP-16	439006	208965	9/20/2022 15:08	9/14/2022 8:56	Assessment	New
TPP-16	457786	208965	10/17/2022 11:02	9/14/2022 8:56	New	Assessment
TPP-16	457789	208965	10/17/2022 11:03	9/14/2022 8:56	Assessment	Internal Review
TPP-16	490205	208965	10/27/2022 15:15	9/14/2022 8:56	Internal Review	On Hold
TPP-16	539391	208965	1/11/2023 15:24	9/14/2022 8:56	On Hold	Backlog

This query does not get a duration as the last column in the query. The query creates a table that is then published and utilized by BI products for graphing and analysis.

SELECT 
    IssueChangelogs.IssueKey IssueKey, 
    IssueChangelogs.HistoryId HistoryId, 
    IssueChangelogs.IssueId IssueId, 
    IssueChangelogs.IssueCreatedDate IssueCreatedDate, 
    IssueChangelogs.ItemFromString ItemFromString, 
    IssueChangelogs.ItemToString ItemToString,
    ICLPrev.Created PrevCreated, --For Testing
    IssueChangelogs.Created Created, 
    ICLPrev.HistoryID PrevHistoryID, --For Testing
    CASE 
        -- If the join found a match for a previous status, then we can calculate the Duration it was in that state.
        WHEN ICLPrev.HistoryID IS NOT NULL 
            THEN DATEDIFF(hour, ICLPrev.Created,  IssueChangeLogs.Created)/24
        -- If the state was new then we need to use the IssueCreatedDate as the start date as the default state is New for each issue.
        WHEN  IssueChangeLogs.ItemFromString LIKE '%New%'
            THEN Round(DATEDIFF(hour, IssueChangeLogs.IssueCreatedDate, IssueChangeLogs.Created), 2)/24
        -- Else, let's add something easy to identify so when we test and look at the table we know what occured.
        ELSE 0.01
        END AS Duration
FROM      
    TableNameRedacted AS IssueChangelogs
LEFT JOIN 
    TableNameRedacted AS ICLPrev 
ON ICLPrev.IssueKey = IssueChangeLogs.IssueKey AND 
       ICLPrev.ItemToString = IssueChangeLogs.ItemFromString
WHERE
    IssueChangelogs.IssueKey LIKE '%TPP%'

Sample data is (was?) great, but you also need to _specify_ the expected result. — jarlh, Jan 25 '23 at 13:40
If you want to do calculations with values from different rows, you may want to check for function `lag`. — Andrew, Jan 25 '23 at 13:53
Using Tibco TDV to create the query and publish for use in Tableau. The Database is SQL and the application is Jira. . . . The Column "Created" is of the DateTime data type I believe, same with IssueCreatedDate. — Chris McDeed, Jan 25 '23 at 14:11
Actually their data type is reported as TimeStamp by TDV. Just in case that makes a huge difference. :) — Chris McDeed, Jan 25 '23 at 14:41
If you sort issues by `HistoryID`, don't you get the rows in the right order already? Then you just need the `lag` function to get the value from the row above. No need to join with itself. Besides, with your join you are causing multiple repeated joins as the statuses occur several times. `IssueKey` + `ItemFromString` is NOT unique. For the first row (HistoryID 434905), you will get many `ICLPrev` joins even though it is the first one: HistoryIDs 437795 and 439006 are changes to New. BTW, you can remove columns `IssueID` and `IssueCreatedDate` as they just clutter the view and are not needed. — Andrew, Jan 25 '23 at 19:29
To make things more interesting, the table contains other entries for changes to priority and other fields for the issues. So these aren't always perfectly in order, unless I filter those out. Which the join does well in its current state. True, for this example I could leave out IssueID, but its needed for the final solution. IssueCreatedDate is needed to calculate the Duration for the new state. — Chris McDeed, Jan 26 '23 at 15:28
I've worked with lag for a fair amount today in-between other projects and don't think it will work for my case. Well i should say I couldn't get it to work. :) Still trying. While the sample data is pretty straight forward there are often other rows salted in between such as the log for changes in the Issues priority. But absolutely correct, when sorted by date or ID the top most result is the previous state change and contains the date I need. Why I want to just pull the top result from the join. — Chris McDeed, Jan 26 '23 at 20:54
Can't you filter out those queries which have other unrelated changes? About those fields, I suggested removing them just from the question, not your project. :) You can update your question with your current query and output if it's still not working. — Andrew, Jan 27 '23 at 22:00
In theory I could. I could give them multiple tables and split out the priority and other changes so they would still have that data to work with, that would work. My worry is, like everything, things change. Their process may change and include another step that I won't account for or another status will get added that I don't filter out. I'll explore it. Also wanted to explore if I can do multiple queries. One to create the table I am going to return, then another to fill in the last column, where I can lookup a matching row(s) then narrow it down to the previous most recent date. Thank you. — Chris McDeed, Jan 30 '23 at 17:41
Please clarify via edits, not comments. Please delete & flag obsolete comments. Please avoid social & meta content. Please in SQL give every table a tiny mnemonic table alias & dot every column use with its table's alias. — philipxy, Jan 31 '23 at 22:21
Debug questions require a [mre]--cut & paste & runnable code including initialization; desired & actual output (including verbatim error messages); tags & versions; clear specification & explanation. For SQL include DDL & tabular initialization code. For debug that includes the least code you can give that is code that you show is OK extended by code that you show is not OK. [ask] [Help] When you get a result you don't expect, pause your overall goal, chop to the 1st subexpression with unexpected result & say what you expected & why, justified by documentation. (Debugging fundamental.) — philipxy, Jan 31 '23 at 22:22
[Fetch the row which has the Max value for a column](https://stackoverflow.com/q/121387/3404097) — philipxy, Jan 31 '23 at 22:39
Philipxy, Thanks for cleaning stuff up and removing the human from the post to make it a better technical writing sample. :) Appreciate the look. As for the "Fetch the row which has max value" That is pretty much the question I was asking, as to how to do that in this scenario. One it was possible to have the max value be wrong, but worked out how to filter those out. Then worked out a method to use row_count to identify the top one and select only those. Posted the solution that is working for me now. I am sure there are other methods that might be more efficient. Thank you. — Chris McDeed, Feb 01 '23 at 13:22

score 0 · Answer 1 · edited Jan 31 '23 at 22:26

I used the Row_number function to identify the rows I wanted, then wrap it in a CTE and pull those row numbers, then add one more criteria to the join to rule out oddities.

First we select what we need from the table and then pull out the previous entry related to the state (status) change so we can calculate the time it spent in that state as Duration. So create a Temporary Table (CTE) and order this to use the Row_Number function to identify the rows we want. Then select only those rows from the CTE.

WITH CTE AS (
SELECT
    IssueChangelogs.IssueKey IssueKey, 
    IssueChangelogs.HistoryId HistoryId, 
    IssueChangelogs.IssueId IssueId, 
    IssueChangelogs.AuthorDisplayName AuthorDisplayName, 
    IssueChangelogs.IssueCreatedDate IssueCreatedDate, 
    IssueChangelogs.ItemFromString ItemFromString, 
    IssueChangelogs.ItemToString ItemToString,
    ICLPrev.Created PrevRelatedCreatedDate,
    ICLPrev.HistoryId PrevRelatedHistoryID,
    IssueChangelogs.Created Created,
    CASE 
        -- If the join found a match for a previous status, then we can calculate the Duration it was in that state.
        WHEN ICLPrev.HistoryID IS NOT NULL 
        --  Funky math is so we can get the number of days with 2 decimal points.
            THEN CAST((DATEDIFF(second, ICLPrev.Created, IssueChangeLogs.Created)/86400.00)AS DECIMAL(6,2))
        -- If the state was new then we need to use the IssueCreatedDate as the start date as the default state is New for each issue.
        WHEN  IssueChangeLogs.ItemFromString LIKE '%New%'
        --  Again calculate to get 2 decimal places for duration in days.
            THEN CAST((DATEDIFF(second, IssueChangeLogs.IssueCreatedDate, IssueChangeLogs.Created)/86400.00)AS DECIMAL(6,2))
        -- Else, let's add something easy to identify so when we test and look at the table we know what occured.
        ELSE NULL
        END AS Duration,
    -- Here we are going to assign the number 1 to the rows we want to keep. 
    ROW_NUMBER() OVER(PARTITION BY IssueChangeLogs.IssueKey, IssueChangeLogs.HistoryID, IssueChangeLogs.Created
        -- Since duplicates can exist we only want the most recent match. The join ensures we don't have any start dates greater than the end date and the order by here puts the most recent (DESC) on top so its gets tagged with a 1.
        ORDER BY ICLPrev.Created DESC) AS RowNo
FROM      
    /shared/"Space"/CUI/TandTE/Physical/Metadata/JIRA/CIS/JIRA/IssueChangelogs IssueChangelogs
LEFT JOIN 
    /shared/"Space"/CUI/TandTE/Physical/Metadata/JIRA/CIS/JIRA/IssueChangelogs AS ICLPrev 
    -- These are the closest we can get to a key; however, it will still produce duplicates since an issue can return to a previous state multiple times.
    ON ICLPrev.IssueKey = IssueChangeLogs.IssueKey AND ICLPrev.ItemToString = IssueChangeLogs.ItemFromString
    -- This will remove anything that doesn't make sense in the join, for example a state change where the start date is greater than the end date. Still leaves duplicates which the row number will fix.
    AND ICLPrev.Created < IssueChangeLogs.Created
WHERE
    IssueChangelogs.IssueKey LIKE '%TPP%'
)
-- Now that the CTE table has been created and we have identifed the rows we want from the join with a 1 in RowNo, we filter by those and only pull the columns we want.
SELECT 
    CTE.IssueKey, 
    CTE.HistoryId, 
    CTE.IssueId, 
    CTE.AuthorDisplayName, 
    CTE.IssueCreatedDate, 
    CTE.ItemFromString, 
    CTE.ItemToString,
    CTE.PrevRelatedHistoryID,
    CTE.PrevRelatedCreatedDate,
    CTE.Created,
    CTE.Duration
FROM CTE 
WHERE CTE.RowNo = 1

How do you prevent all the wrong joins like I mentioned in my comment above? I really think you should be using `lag` and avoiding that join. — Andrew, Feb 04 '23 at 03:45
I can see about using lag again, but couldn't get it to work. Sometimes the previous row is unrelated to the current row since its a log it gets many entries. Grouping helped some but not entirely. I can filter out unwanted information but then if the process ever changes then this query has to change to include it. The date filter on the join filters off half of the bad joins, then the row count sort by puts everything else that is undesirable in rows 2+. — Chris McDeed, Feb 07 '23 at 17:21

When Joining a SQL Table to Itself, Return only One Result. Time Log Entry

1 Answers1