0

I've been searching many similar posts here but they seem not to meet my requirement to use SQL Server IDENTITY.

I have a dataframe extracted from another data source which might include duplicate Name.

Dataframe

Date Time Name Result
3/28/2022 7:00:00 abc 0.23
3/28/2022 7:00:00 abc 0.19
3/28/2022 7:05:00 def 0.36
3/28/2022 7:10:00 ghi 0.29

When I load to SQL Server, I need to create a Retest_Count column in order to indicate the test order. If the Name is duplicated and same timestamp, always set the newer one to be 0 and the older one to be 1 based on the 'original data order'. Like this:

SQL Table 1

Date Time Name Result Retest_Count
3/28/2022 7:00:00 abc 0.23 1
3/28/2022 7:00:00 abc 0.19 0
3/28/2022 7:05:00 def 0.36 0
3/28/2022 7:10:00 ghi 0.29 0

When I insert new data with same Name but different timestamp, always give 0 to the new data and the existing table need to get auto increment like this:

SQL Table 2

Date Time Name Result Retest_Count
3/28/2022 7:00:00 abc 0.23 2
3/28/2022 7:00:00 abc 0.19 1
3/29/2022 13:00:00 abc 0.18 0
3/28/2022 7:05:00 def 0.36 0
3/28/2022 7:10:00 ghi 0.29 1
3/29/2022 21:05:00 ghi 0.25 0

Is this possible to achieve by SQL Server script? Python solution is considered also.

Kelvin Lo
  • 189
  • 1
  • 11
  • This would seem better as a column in a `VIEW`, rather than a persisted value. You would still need an `IDENTITY` on your table, however, to denote the order of the data (as i assume it's coincidence that `Result` is in ascending order). – Thom A Mar 28 '22 at 16:10
  • The purpose for this additional column is when I set Retest_Count = '0', I can always refer to the most updated data and filter out the old one. – Kelvin Lo Mar 28 '22 at 16:14
  • You don't need a column like `Retest_Count` to achieve that, you could just use a [Get top 1 row of each group](https://stackoverflow.com/questions/6841605/get-top-1-row-of-each-group) (which, again, you could use a `VIEW` for easy use). This smells like an [XY Problem](//xyproblem.info). – Thom A Mar 28 '22 at 16:16
  • Thank you! "Get top 1 row of each group" seems like a solution but if a duplicate data with same timestamp in the table, perhaps it couldn't identify. I'll still try it out – Kelvin Lo Mar 28 '22 at 16:24

1 Answers1

1

As I mentioned in the comments, this is far better suited for a VIEW. You will need to ensure you have some kind of column in your table that always increases (such as an IDENTITY) to achieve this though (as I note you have 2 tests that occur at the same time for the same Name, so there isn't a way to denote which is "first").

Then you can create a VIEW like this:

CREATE VIEW dbo.YourView_RetestCounts AS
    SELECT [Date],
           [Time],
           [Name],
           Result,
           ROW_NUMBER() OVER (PARTITION BY [Name] ORDER BY IdentityColumn DESC) - 1 AS Retest_Count
    FROM dbo.YourTable;

If, however, your goal is to filter to the latest row, as your comments suggest, then use a VIEW with a Get top 1 row of each group solution:

CREATE VIEW dbo.YourView_LatestTest AS
    WITH CTE AS(
        SELECT [Date],
               [Time],
               [Name],
               Result,
               ROW_NUMBER() OVER (PARTITION BY [Name] ORDER BY IdentityColumn DESC) AS Retest_Count
        FROM dbo.YourTable)
    SELECT [Date],
           [Time],
           [Name],
           Result
    FROM CTE
    WHERE Retest_Count = 1;
Thom A
  • 88,727
  • 11
  • 45
  • 75
  • When I used `ORDER BY [Date], [Time] DESC`, neither one gives me the reverse order. The latest record always has larger Retest_Count. Do you know why? – Kelvin Lo Mar 29 '22 at 07:09
  • Because you're ordering by date **ascending** @KelvinLo . – Thom A Mar 29 '22 at 07:41