0

I have a table which records the chlorine samples in distribution for a water company at different sites. To start I have the following sub-query:

SELECT CONCAT(MIN(in_dist.chlorine_mgl), ' - ', MAX(in_dist.chlorine_mgl))
FROM gis_user.BACTISURVEY AS in_dist
    INNER JOIN (SELECT MAX(last_edited_date ) AS max_date, collection_location, sample_type, date_sample_collected
    FROM gis_user.BACTISURVEY 
    GROUP BY date_sample_collected, collection_location, sample_type ) AS tbl ON tbl.collection_location = in_dist.collection_location 
    AND tbl.date_sample_collected = in_dist.date_sample_collected
WHERE samples.district_distribution = in_dist.district_distribution

This subquery has a few conditions:

  • So for these sites in my application you have the option to edit the entry, but the database keeps an archive of these edits in the same table, distinguished by 'edit date', so naturally I need to take the latest edit based on datetime (I'm expecting a lot of edits because of the nature of the program)
  • I want to take the minimum chlorine sample value and the maximum chlorine value for a group (This group being district_distribution AND ALSO date_sample_collected)

So basically I want the following:

DISTRICT        CHLORINE        DATE COLLECTED
district1       1.2 - 2.6       March 30th 2020
district1       1.4 - 3.2       April 1st 2020
district2       1.6 - 2.4       March 30th 2020

But I'm getting something like the following:

DISTRICT        CHLORINE        DATE COLLECTED
district1       1.2 - 3.2       March 30th 2020
district1       1.2 - 3.2       April 1st 2020
district2       1.6 - 2.4       March 30th 2020

So what that shows me is that it's taking all the chlorine samples regardless of date within the district_distribution group and taking the minimum and maximum value of those. I want it to differentiate between the collection date as well as the district.

EDIT --

For the below answer, would a case like this also be covered by your query?:

-- Unique Entry 1 (district 1, Type A, AUS, 30th Mar)
    ('district 1', 'type A', 'AUS', '2020-02-30', '2020-04-01', 1.2) --Another row expected min
    ,('district 1', 'type A', 'AUS', '2020-02-30', '2020-04-01', 2.8) --Another row expected max
    ,('district 1', 'type A', 'AUS', '2020-03-30', '2020-04-03', 1.1)
    ,('district 1', 'type A', 'AUS', '2020-03-30', '2020-04-03', 5.2)
    ,('district 1', 'type A', 'AUS', '2020-03-30', '2020-04-05', 1.21) --Expected min
    ,('district 1', 'type A', 'AUS', '2020-03-30', '2020-04-05', 1.44) --Expected max

EDIT2: So I used your partition and rank suggestion and that seems to be the trick! I thank you! Marked Answer below.

Kristin Vernon
  • 155
  • 1
  • 2
  • 20
  • 1
    I'm lost. You show a quite complicated query, but I'm not sure what that has to do with the question. And, I don't know what your data looks like either. – Gordon Linoff Apr 14 '20 at 23:12

1 Answers1

0

Answer at dbfiddle.uk

Interpretation of Question

  • Have a table which has data with repeated labels due to archival entries, which are distinguished with an edit timestamp.
  • Want to find Max/Min range for each distinct label (District, Date collected, type, etc.) that is the most up-to-date data, identified by the last_edit_date column

Table used

declare @bactisurvey table (
    district_distribution varchar(200)
    ,sample_type varchar(100)
    ,collection_location varchar(100)
    ,date_sample_collected datetime
    ,last_edited_date datetime
    ,chlorine_mgl float
);

insert into @bactisurvey 
values
    -- Unique Entry 1 (district 1, Type A, AUS, 30th Mar)
    ('district 1', 'type A', 'AUS', '2020-03-30', '2020-04-01', 1.2)
    ,('district 1', 'type A', 'AUS', '2020-03-30', '2020-04-01', 2.8)
    ,('district 1', 'type A', 'AUS', '2020-03-30', '2020-04-03', 1.1)
    ,('district 1', 'type A', 'AUS', '2020-03-30', '2020-04-03', 5.2)
    ,('district 1', 'type A', 'AUS', '2020-03-30', '2020-04-05', 1.21) --Expected min
    ,('district 1', 'type A', 'AUS', '2020-03-30', '2020-04-05', 1.44) --Expected max

    -- Unique Entry 2 (district 2, Type B, AUS, 30th Mar)
    ,('district 2', 'type B', 'AUS', '2020-03-30', '2020-04-01', 1.2)
    ,('district 2', 'type B', 'AUS', '2020-03-30', '2020-04-01', 2.8)
    ,('district 2', 'type B', 'AUS', '2020-03-30', '2020-04-05', 1.11) --Expected min
    ,('district 2', 'type B', 'AUS', '2020-03-30', '2020-04-05', 5.21) --Expected max
    ,('district 2', 'type B', 'AUS', '2020-03-30', '2020-04-05', 1.2)
    ,('district 2', 'type B', 'AUS', '2020-03-30', '2020-04-05', 1.4);

The data shows two distinct entries, that are edited over multiple days. I assume the edits will provide two values (a new minimum and a new maximum, see Caveats).

E.g. for the first unique entry:

  • On April 1st the original range is 1.2-2.8.
  • It is then edited on April 3rd with a new range of 1.1-5.2
  • Finally on April 5th, the new range is 1.21-1.44 (purposely higher min and lower max to show we are getting latest data)

Query

;with TopEdits as (
    select 
        edit_num = rank() over (partition by district_distribution, sample_type, collection_location, date_sample_collected order by last_edited_date desc)
        ,district_distribution
        ,sample_type
        ,collection_location
        ,date_sample_collected
        ,last_edited_date
        ,chlorine_mgl
    from @bactisurvey
)
select
    district_distribution
    ,sample_type
    ,collection_location
    ,date_sample_collected
    ,min_chlorine = min(chlorine_mgl)
    ,max_chrloine = max(chlorine_mgl)
from TopEdits
where
    edit_num = 1
group by
    district_distribution
    ,sample_type
    ,collection_location
    ,date_sample_collected

Results & Explaination

+-----------------------+-------------+---------------------+-------------------------+--------------+--------------+
| district_distribution | sample_type | collection_location | date_sample_collected   | min_chlorine | max_chlorine |
+-----------------------+-------------+---------------------+-------------------------+--------------+--------------+
| district 1            | type A      | AUS                 | 2020-03-30 00:00:00.000 | 1.21         | 1.44         |
+-----------------------+-------------+---------------------+-------------------------+--------------+--------------+
| district 2            | type B      | AUS                 | 2020-03-30 00:00:00.000 | 1.11         | 5.21         |
+-----------------------+-------------+---------------------+-------------------------+--------------+--------------+
  • The CTE TopEdits is effectively the same as the @bactisurvey table but with an additional edit_num column.
  • This column orders the rows with the same district_distribution, sample_type, collection_location and date_sample_collected with a simple integer, with the latest edits having edit_num=1.
  • See this Stack Overflow answer for more on partition by
  • When we SELECT the TopEdits will filter with where edit_num = 1 to get the top edits for each unique (district_distribution, sample_type, collection_location, date_sample_collected) and find the max/min between these rows.
  • N.B I assume multiple edits will occur for each unique label to reflect updated minimums and maximums.

Caveats

I'm not sure how you will tell apart the 'latest minimum' and 'latest maximum' value. For example, consider two entries:

  • Entry for 2020-03 value 1.5 edited Today
  • Entry for 2020-03 value 1.6 edited Yesterday

Is the new 1.5 value displacing the previous maximum of 1.6, or will it be the new 'minumum'?

I assumed in my answer that we actually get two edits for the exact same last_edit_date, with one edit being the new minimum and one edit being the new maximum.

If this is not the case, then you could consider using row_number() instead of rank() as row_number() will increment our edit_num column regardless of duplicate rows or not. Then you can filter with edit_num <= 2 to get the top 2 rows.

Jamie Phan
  • 1,112
  • 7
  • 15
  • This `rank()` and `row_number()` does seem like it would help but your interpretation of my question is not completely right. See, you're right that I am trying to select from a pool of labels which are only the latest edits, however, date_sample_collected for district1 and district2 are collected month to month (I'm trying to find the min and max chlorine grouped by the date they are collected). Also to answer your question for the caveat, if a value of 1.6 for sample date 2020-03 is changed to 1.5, and 1.6 was the original minimum value then 1.5 would become the new minimum value. – Kristin Vernon Apr 15 '20 at 15:18
  • See EDIT in original question for more clarification – Kristin Vernon Apr 15 '20 at 15:27