0

I have a table like this:

 id  col2   modified date
 1  red     1/7/2019
 1  green   2/7/2019
 1  blue    3/7/2019
 2  green   1/12/2019
 2  blue    3/02/2019
 2  red     4/19/2019
 3  red     12/12/2018
 3  green   02/10/2019

I need to create a query as follows:

Suppose I ran it on April 2019 then the output should look like this

id  col2(first day of every month)  col4
1   1/1/2019                        NULL
1   2/1/2019                        red
1   3/1/2019                        green
1   4/1/2019                        blue
2   1/1/2019                        NULL
2   2/1/2019                        green
2   3/1/2019                         green
2   4/1/2019                         blue
3   1/1/2019                         red
3   2/1/2019                         red
3   3/1/2019                        green
3   4/1/2019                        green

So basically I need to know the value of col2 in the first of every month per ID. for example: for id '1' there is no last value for col2 in table 1 as it's modified on 7th Jan.so the col4 value will be NULL in the second table.but on 1st Feb it will show red as it was the most recent value on that date. Similarly how this logic works for other ids. We need to track back the most recent value of col2 per id on 1st of every month.

I have tried multiple things but couldn't able to take all the scenarios at once.

Dale K
  • 25,246
  • 15
  • 42
  • 71
  • Are you *really* storing your dates as a `varchar`? Those dates don't have a year attached to them, so how do you know that `'12-Jan'` is before or after `'7-Mar'`? They could be from completely different years. – Thom A Aug 30 '19 at 13:55
  • I have made it on excel so that i can show you what i exactly want.You can assume its 2019 apart from one record where i have mentioned its from last year – Gaurav singh Aug 30 '19 at 13:58
  • Assuming really isn't good enough; especially when the process will simply break when 2020 comes along. If your dates don't contain a full date in them they are effectively useless. Don't show us what you're data "looks" like in Excel, provide us with (sample) data that SQL Server has; preferably as DDL and DML statements. – Thom A Aug 30 '19 at 14:01
  • I have updated the question as per you suggestion.So basically logic should work even if the year changes – Gaurav singh Aug 30 '19 at 14:08
  • Will you have more than one record for a particular `id` in a given month? Just wondering why you need the latest, as there is only one in each month in your sample data. – stahamtan Aug 30 '19 at 15:04
  • Yes there can be more than 1 in a particular month – Gaurav singh Aug 30 '19 at 16:39

3 Answers3

1

Rextester DEMO:

There's a lot of unknowns in your question: how do we know to limit for just 4 months? Could there be multiple color changes in a given month and would you want to list each? So I made a few assumptions based on the expected results you define. However I believe you have an error in your expected results for the 4th entry on ID 1. I don't think it would be green; but should be blue. If this assumption is wrong, I can't find any pattern to your expected results.

I think to solve this knowledge of CROSS JOIN and OUTER APPLY are needed. IT may also benefit to know how to use a recursive CTE (CTE = COMMON TABLE EXPRESSION) to get dates within a range; depending on your long term needs; or, as suggested in comments, just have a "dates" table you could pull from.

So for the below:

  • CTE is a table of your data
  • Dates is a table containing the first of every month for the 4 months in question. This dataset could be generated based on your data and I've links of how to do that at the end.
  • CROSS join is used to ensure we get 4 months per ID incase there are gaps in data
  • OUTER APPLY is used to get the most recent color change prior to that records start of month; if such a record exists. we use outer apply because such records may not exist. as exist for entry 1
--CTE is your sample data
with cte (id,  col2,   modifieddate) as 
    (SELECT 1,   'red',  cast('20190107' as date)  UNION ALL
     SELECT 1,   'green',cast('20190207' as Date) UNION ALL
     SELECT 1,   'blue',cast('20190307' as Date) UNION ALL    
     SELECT 2,   'green',cast('20190112' as Date) UNION ALL   
     SELECT 2,   'blue',cast('20190302' as Date) UNION ALL    
     SELECT 2,   'red',cast('20190419' as Date) UNION ALL     
     SELECT 3,   'red',cast('20181212' as Date) UNION ALL     
     SELECT 3,   'green',cast('20190210' as Date)),
-- You didn't define how you know where to start /stop so I just based this on 
-- your results which only went for four months Jan-April of 2019.
  Dates as (SELECT cast('20190101' as date) FirstofMonth  UNION ALL
               SELECT cast('20190201' as date) FirstofMonth  UNION ALL
               SELECT cast('20190301' as date) FirstofMonth  UNION ALL
               SELECT cast('20190401' as date) FirstofMonth )
--This is really the steps needed
--Cross join the dates to your unique ID list so we get 1 date per ID entry  This fills in the missing dates if any exist.
-- Then we use an outer apply to get the most recent color change before that first of month for that ID.  We use a correlated query to only get the most recent color change before the modified date on the record in question.  Thus we have a Top 1 and order by modifed date desc.
     SELECT Z.iD, A.FirstofMonth, Col2 as Col4
     FROM Dates A
     CROSS JOIN (SELECT DISTINCT ID FROM CTE) Z
     OUTER APPLY(SELECT TOP 1 * FROM CTE B
                 WHERE Z.ID = B.ID
                   and B.ModifiedDate<=A.FirstOfMonth
                 ORDER BY B.ModifiedDate desc) X

Giving us:

+----+----+---------------------+-------+
|    | iD |    FirstofMonth     | Col4  |
+----+----+---------------------+-------+
|  1 |  1 | 01.01.2019 00:00:00 | NULL  |
|  2 |  1 | 01.02.2019 00:00:00 | red   |
|  3 |  1 | 01.03.2019 00:00:00 | green |
|  4 |  1 | 01.04.2019 00:00:00 | blue  | <-- I think you have a error in expected results.
|  5 |  2 | 01.01.2019 00:00:00 | NULL  |
|  6 |  2 | 01.02.2019 00:00:00 | green |
|  7 |  2 | 01.03.2019 00:00:00 | green |
|  8 |  2 | 01.04.2019 00:00:00 | blue  |
|  9 |  3 | 01.01.2019 00:00:00 | red   |
| 10 |  3 | 01.02.2019 00:00:00 | red   |
| 11 |  3 | 01.03.2019 00:00:00 | green |
| 12 |  3 | 01.04.2019 00:00:00 | green |
+----+----+---------------------+-------+

Now, you may need a dynamic date generator to get the first of every month between the dates in your results; examples of those can be found on other stack questions such as: Get all dates between two dates in SQL Server

or https://social.msdn.microsoft.com/Forums/windowsdesktop/en-US/f648408f-bf91-4f84-8f69-94df8506d4a5/getting-all-months-start-and-end-dates-between-two-dates?forum=transactsql

Both of which use a recursive CTE and a start/end date to generate dates in a range. the first doing all dates, the second just doing the first and last day of the months. I would think the 2nd would suffice for your dates if you use min/max of your base table for your date ranges.

halfer
  • 19,824
  • 17
  • 99
  • 186
xQbert
  • 34,733
  • 2
  • 41
  • 62
  • Rather than a `Dates` CTE, I'd probably just go full on with an entire Calendar Table, and `INNER JOIN` it back to your data with a filter in the `JOIN`. Calendar Tables are immensely useful anyway, and should probably be added to almost every database that you'll be performing analytics on. EDIT: Or `OUTER JOIN` to include the records missing a specific Day1. – Shawn Aug 30 '19 at 15:32
  • No arguments here... it would be easier in the long run and likely faster. But as we don't know the context I just hardcoded the dates for this question. – xQbert Aug 30 '19 at 15:33
  • Sorry guys i didn't put all the information which is required.So basically this query will be used in a report and they can run it anytime and in whichever month they run the report it will show that many rows in the table so in my example its April so there 4 rows,if someone runs it in July it will show 7 records per ID and each row will have the most recent red ,green or blue value on that day of month and you were there is an error in my expected result so i corrected it. – Gaurav singh Aug 30 '19 at 16:34
  • I still think this works. It looks to me like you're after "Current Year" dates only which could be generated based on current date and jan 1 of current year. Just has to be tweeked a bit but the underlying concept is still sound. – xQbert Aug 30 '19 at 18:26
0

You can do a PARTITION BY and get the latest for each id in each month and then marge it with a table that has all combinations of id values and months; m_id table below. Here is a demo

WITH 
data AS
(
    SELECT *, 
        DATEADD(d, 1, EOMONTH(modified_date)) AS FirstOfNextMonth,
        RANK() OVER (
            PARTITION BY id, DATEADD(d, 1, EOMONTH(modified_date))
            ORDER BY modified_date DESC
            ) AS rn
    FROM d
),
m_id AS 
(
    SELECT * 
    FROM y, (SELECT DISTINCT id from d) as p
)

SELECT m_id.id, m_id.FOM, latest.col2 
FROM m_id LEFT JOIN
    (
        SELECT * FROM data
        WHERE rn = 1
    ) AS latest
ON m_id.FOM = latest.FirstOfNextMonth AND m_id.id = latest.id

Which returns below, you can also filter for the months that haven't reached yet (demo).

    id  FOM                 col2
1   1   01.01.2019 00:00:00 NULL
2   1   01.02.2019 00:00:00 red
3   1   01.03.2019 00:00:00 green
4   1   01.04.2019 00:00:00 blue
5   1   01.05.2019 00:00:00 NULL
6   1   01.06.2019 00:00:00 NULL
7   1   01.07.2019 00:00:00 NULL
8   1   01.08.2019 00:00:00 NULL
9   1   01.09.2019 00:00:00 NULL
10  1   01.10.2019 00:00:00 NULL
11  1   01.11.2019 00:00:00 NULL
12  1   01.12.2019 00:00:00 NULL
13  2   01.01.2019 00:00:00 NULL
14  2   01.02.2019 00:00:00 green
15  2   01.03.2019 00:00:00 NULL
16  2   01.04.2019 00:00:00 blue
17  2   01.05.2019 00:00:00 red
18  2   01.06.2019 00:00:00 NULL
19  2   01.07.2019 00:00:00 NULL
20  2   01.08.2019 00:00:00 NULL
21  2   01.09.2019 00:00:00 NULL
22  2   01.10.2019 00:00:00 NULL
23  2   01.11.2019 00:00:00 NULL
24  2   01.12.2019 00:00:00 NULL
25  3   01.01.2019 00:00:00 red
26  3   01.02.2019 00:00:00 NULL
27  3   01.03.2019 00:00:00 green
28  3   01.04.2019 00:00:00 NULL
29  3   01.05.2019 00:00:00 NULL
30  3   01.06.2019 00:00:00 NULL
31  3   01.07.2019 00:00:00 NULL
32  3   01.08.2019 00:00:00 NULL
33  3   01.09.2019 00:00:00 NULL
34  3   01.10.2019 00:00:00 NULL
35  3   01.11.2019 00:00:00 NULL
36  3   01.12.2019 00:00:00 NULL

stahamtan
  • 848
  • 6
  • 10
  • 1
    Pretty sure once a color is established, they wan't the color to persist until changed in a subsequent month. So lines 16, 26,28 look off to me. My first thought was to use LAG to do this; but I thought there could be multiple gaps and ended up with an outer apply to compensate. – xQbert Aug 30 '19 at 16:14
  • This is pretty close to what i want but xQbert is right once the color are established.it will not change until its modified. – Gaurav singh Aug 30 '19 at 16:43
0

This is doable. I used a bunch of sub-selects, and it makes the query a bit long an tedious. It can very likely be simplified a lot more, and I haven't tested performance. I'm not sure which version of SQL you're using, but newer versions should have some functionality that will allow you to simplify also. You'll have to tweak it.

I also added a Date Dimension table to simplify working with dates. As I said above, I'm of the opinion that almost all databases can benefit from a Date Dimension and a Numbers Table. There are a bazillion articles as to why and how, but I've always been a fan of Aaron Bertrand's articles.

SQL Fiddle (see the Fiddle for setup)

Query:

SELECT s5.id, s5.d, s5.col2, s5.col4
FROM (
    SELECT s4.id, s4.d, s4.col2, s4.theDay, s4.theYear
      /* 5. Smear the past data up to the next change. */
      , MAX(s4.col2) OVER (PARTITION BY s4.c1, s4.id) AS col4
    FROM (
        SELECT s1.d, s1.theDay, s1.theYear, s1.id , s2.col2
            /* 4. Identify the records that should be grouped in the window. */
            , COUNT(s2.col2) OVER (ORDER BY s1.id, s1.d) AS c1
        FROM ( 
            /* 1. build the list of days for each id */
            SELECT dd.d, dd.theDay, dd.theYear, s1.id
            FROM datedim dd 
            CROSS APPLY ( SELECT DISTINCT t.id FROM t1 t) s1
        ) s1
        /* 3. JOIN the two together. */
        LEFT OUTER JOIN ( 
            /* 2. Remove dupes from modified records */
            SELECT s3.id, s3.col2, s3.modified
            FROM (
                SELECT t1.id, t1.col2, t1.modified, d1.theMonth AS monthModified
                    /* 2a. Use the ROW_NUMBER() Window Function to number changes in a month. */
                    , ROW_NUMBER() OVER (PARTITION BY t1.id, d1.theYear, d1.theMonth ORDER BY t1.modified DESC) AS rn
                FROM t1
                INNER JOIN datedim d1 ON t1.modified = d1.d
            ) s3
            WHERE s3.rn = 1
        ) s2 ON s1.d = s2.modified
            AND s1.id = s2.id
    ) s4
)s5
/* 6. Filter for only the 1st day of the month. */
WHERE s5.theDay = 1
    AND s5.theYear = year(getDate())
    AND s5.d <= getDate()
/* 6a. Also, if we set a color before 1/1, moving the filter for the date and the year will allow us to carry the color forward from the last time it was set. */
ORDER BY s5.id, s5.d

This Gives You::

| id |          d |   col2 |   col4 |
|----|------------|--------|--------|
|  1 | 2019-01-01 | (null) | (null) |
|  1 | 2019-02-01 | (null) |    red |
|  1 | 2019-03-01 | (null) |  green |
|  1 | 2019-04-01 | (null) |   blue |
|  1 | 2019-05-01 | (null) |   blue |
|  1 | 2019-06-01 | (null) |   blue |
|  1 | 2019-07-01 | (null) |   blue |
|  1 | 2019-08-01 | (null) |   blue |
|  2 | 2019-01-01 | (null) | (null) |
|  2 | 2019-02-01 | (null) |  green |
|  2 | 2019-03-01 | (null) |  green |
|  2 | 2019-04-01 | (null) |   blue |
|  2 | 2019-05-01 | (null) |    red |
|  2 | 2019-06-01 | (null) |    red |
|  2 | 2019-07-01 | (null) |    red |
|  2 | 2019-08-01 | (null) |    red |
|  3 | 2019-01-01 | (null) | yellow |
|  3 | 2019-02-01 | (null) | yellow |
|  3 | 2019-03-01 | (null) |  green |
|  3 | 2019-04-01 | (null) |  green |
|  3 | 2019-05-01 | (null) |  green |
|  3 | 2019-06-01 | (null) |  green |
|  3 | 2019-07-01 | (null) |  green |
|  3 | 2019-08-01 | (null) |  green |
|  4 | 2019-01-01 | (null) | (null) |
|  4 | 2019-02-01 | (null) | (null) |
|  4 | 2019-03-01 | (null) |  green |
|  4 | 2019-04-01 | (null) |  green |
|  4 | 2019-05-01 | (null) |  green |
|  4 | 2019-06-01 | orange | orange |
|  4 | 2019-07-01 | (null) | orange |
|  4 | 2019-08-01 | (null) | orange |

I tried to comment the query so you could follow along with my logic. I also added an additional test case in the changes table to demonstrate how to pick the most recent if 2 or more changes happened in a month. And a second change I added was a check for a color set in the previous year. If that shouldn't be carried forward, then the check for year and date can be moved back up into s1.

Essentially, I used my date table to create a running "calendar" to easily "smear" my change data across the missing days. Then applied those days to each id. Then selected the most recent change and filled in the missing colors. Then selected just the 1st day of each month for each id.

Note that with the Calendar Table / Date Dimension, it becomes easily possible to find the color for the third Tuesday of each month, if you wanted to.

And again, if you have a lot of ids and you're checking this report in December, this could become a lot of data. It will possibly have to be massaged down to a manageable size.

Shawn
  • 4,758
  • 1
  • 20
  • 29