SQL to find the first occurrence of sets of data in a table

Question

Say if I have a table:

CREATE TABLE T
(
    TableDTM  TIMESTAMP  NOT NULL,
    Code      INT        NOT NULL
);

And I insert some rows:

INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:00:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:10:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:20:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:30:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:40:00', 0);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:50:00', 1);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:00:00', 1);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:10:00', 1);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:20:00', 0);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:30:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:40:00', 5);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:50:00', 3);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 12:00:00', 3);
INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 12:10:00', 3);

So I end up with a table similar to:

2011-01-13 10:00:00, 5
2011-01-13 10:10:00, 5
2011-01-13 10:20:00, 5
2011-01-13 10:30:00, 5
2011-01-13 10:40:00, 0
2011-01-13 10:50:00, 1
2011-01-13 11:00:00, 1
2011-01-13 11:10:00, 1
2011-01-13 11:20:00, 0
2011-01-13 11:30:00, 5
2011-01-13 11:40:00, 5
2011-01-13 11:50:00, 3
2011-01-13 12:00:00, 3
2011-01-13 12:10:00, 3

How can I select the first date of each set of identical numbers, so I end up with this:

2011-01-13 10:00:00, 5
2011-01-13 10:40:00, 0
2011-01-13 10:50:00, 1
2011-01-13 11:20:00, 0
2011-01-13 11:30:00, 5
2011-01-13 11:50:00, 3

I've been messing about with sub queries and the like for most of the day and for some reason I can't seem to crack it. I'm sure there's a simple way somewhere!

I would probably want to exclude the 0's from the results, but that's not important for now..

score 4 · Accepted Answer · edited May 23 '17 at 12:26

Revised 15 Jan 11

I'm sure there's a simple way somewhere!

Yes, there is. But first, two Issues.

The table is not a Relational Database table. It does not have an unique key, which is demanded by the RM and Normalisation (specifically that each row must have an unique identifier; not necessarily a PK). Therefore SQL, a standard language, for operating on Relational Database tables, cannot perform basic operations on it.
- it is a Heap (data structure, inserted and deleted in chronological order), with records not rows.
- any and all operations using SQL will be horribly slow, and will not be correct
- SET ROWCOUNT to 1, perform row processing, and SQL will work on the Heap just fine
- your best bet is use any unix utiliy to operate on it (awk, cut, chop). They are blindingly fast. The awk script required to answer your requirement would take 3 mins to write and it will run in seconds for millions of records (I wrote a few last week).
  .
So the question really is SQL to find the first occurrence of sets of data in a non-relational Heap.

Now if your question was SQL to find the first occurrence of sets of data in a Relational table, implying of course some unique row identifier, that would be (a) easy in SQL, and (b) fast in any flavour of SQL ...
- except Oracle, which is known to handle subqueries badly (specifically Tony Andrews' comments, he is a well-known authority on Oracle). In which case, use Materialised Views.
  .
The question is very generic (no complaint). But many of these specific needs are usually applied within a larger context, and the context has requirements which are absent from the specification here. Generally the need is for a simple Subquery (but in Oracle use a Materialised View to avoid the subquery). And the subquery, too, depends on the outer context, the outer query. Therefore the answer to the small generic question will not contain the answer to the actual specific need.

Anyway, I do not wish to avoid the question. Why don't we use a real world example, rather than a simple generic one; and find the first or last occurrence, or minimum or maximum value, of a set of data, within another set of data, in a Relational table ?

Main Query

Let's use the ▶Data Model◀ from your previous question.

Report all Alerts since a certain date, with the peak Value for the duration, that are not Acknowledged

Since you will be using exactly the same technique (with different table and column names) for all your temporal and History requirements, you need to fully understand the basic construct of a Subquery, and its different applications.

Introduction

Note that you have, not only a pure 5NF Database, with Relational Identifiers (composite keys), you have full Temporal capability throughout, and the temporal requirement is rendered without breaking 5NF (No Update Anomalies), which means the ValidToDateTime for periods and durations is derived, and not duplicated in data. Point is, that complicates things, hence this is not the best example for a tutorial on Subqueries.

Remember the SQL engine is a set-processor, so we approach the problem with a set-oriented mindset
- do not dumb the engine down to row-processing; that is very slow
- and more important, unnecessary
Subqueries are normal SQL. The syntax I am using is straight ISO/IEC/ANSI SQL.
- if you cannot code subqueries in SQL, you will be very limited; and then need to introduce data duplication or use large result sets as Materialised Views or temporary tables or all manner of additional data and additional processing, which will be s.l.o.w to v.e.r.y s.l.o.w, not to mention completely unnecessary
- if there is anything you cannot do in a truly Relational Database (and my Data Models always are) without switching to row-processing or inline views or temp tables, ask for help, which is what you have done here.
You need to fully understand the first Subquery (simpler) before attempting to understand the second; etc.

Method

First build the Outer query using minimum joins, etc, based on the structure of the result set that you need, and nothing more. It is very important that the structure of the outer query is resolved first; otherwise you will go back and forth trying to make the subquery fit the outer query, and vice versa.

That happens to require a Subquery as well. So leave that part out for now, and pick that up later. For now, the Outer query gets all (not un-acknowledged) Alerts after a certain date

The ▶SQL code◀ required is on page 1 (sorry, the SO edit features are horrible, it destroys the formatting, and the code is already formatted).

Then build the Subquery to fill each cell.

Subquery (1) Derive Alert.Value

That is a simple derived datum, select the Value from the Reading that generated the Alert. The tables are related, the cardinality is 1::1, so it is a straight join on the PK.

The type of Subquery required here is a Correlated Subquery, we need to correlate a table in the Outer query to a table in the (inner) Subquery.
- in order to do that, we need an Alias for the table in the Outer query, to correlate it to a table in the Subquery.
- to make the distinction, I have used aliases only for such required correlation, and fully qualified names for plain joins
Subqueries are very fast in any engine (except Oracle)
SQL is a cumbersome language. But that's all we have. So get used to it.

The ▶SQL code◀ required is on page 2.

I have purposely given you a mix of joins in the Outer Query vs obtaining data via Subquery, so that you can learn (you could alternately obtain Alert.Value via a join, but that would be even more cumbersome).

The next Subquery we need derives Alert.PeakValue. For that we need to determine the Temporal Duration of the Alert. We have the beginning of the Alert Duration; we need to determine the end of the Duration, which is the next (temporally) Reading.Value that is within range. That requires a Subquery as well, which we better handle first.

Work the logic from the inside, outward. Good old BODMAS.

Subquery (2) Derive Alert.EndDtm

A slightly more complex Suquery to select the first Reading.ReadingDtm, that is greater than or equal to the Alert.ReadingDtm, that has a Reading.Value which is less than or equal to its Sensor.UpperLimit.

Handling 5NF Temporal Data

For handling temporal requirements in a 5NF Database (in which EndDateTime is not stored, as is duplicate data), we work on a StartDateTime only, and the EndDateTime is derived: it is the next StartDateTime. This is the Temporal notion of Duration.

Technically, it is one millisec (whatever the resolution for the Datatype used) less.
However, in order to be reasonable, we can speak of, and report, EndDateTime as simply the Next.StartDateTime, and ignore the one millisecond issue.
The code should always use >= This.StartDateTime and < Next.StartDateTime.
- That eliminates a slew of avoidable bugs
- Note that these comparison operators, which bracket the Temporal Duration, and should be used in a conventional manner throughout as per above, are quite independent of similar comparison operators related to business logic, eg. Sensor.UpperLimit (ie. watch for it, because both are often located in one WHERE clause, and it is easy to mix them up or get confused).

The ▶SQL code◀ required, along with test data used, is on page 3.

Subquery (3) Derive Alert.PeakValue

Now it is easy. Select the MAX(Value) from Readings between Alert.ReadingDtm and Alert.EndDtm, the duration of the Alert.

The ▶SQL code◀ required is on page 4.

Scalar Subquery

In addition to being Correlated Subqueries, the above are all Scalar Subqueries, as they return a single value; each cell in the grid can be filled with only one value. (Non-Scalar Subqueries, that return multiple values, are quite legal, but not for the above.)

Subquery (4) Acknowledged Alerts

Ok, now that you have a handle on the above Correlated Scalar Subqueries, those that fill cells in a set, a set that is defined by the Outer query, let's look at a Subquery that can be used to constrain the Outer query. We do not really want all Alerts (above), we want Un-Acknowledged Alerts: the Identifiers that exist in Alert, that do not exist in Acknowledgement. That is not filling cells, that is changing the content of the Outer set. Of course, that means changing the WHERE clause.

We are not changing the structure of the Outer set, so there is no change to the FROM and existing WHERE clauses.

Simply add a WHERE condition to exclude the set of Acknowledged Alerts. 1::1 cardinality, straight Correlated join.

The ▶SQL code◀ required is on page 5.

The difference is, this is a non-Scalar Subquery, producing a set of rows (one column). We have an entire set of Alerts (the Outer set) matched against an entire set of Acknowledgements.

The matching is processed because we have told the engine that the Subquery is Correlated, by using an alias (no need for cumbersome joins to be identified)
Use 1, because we are performing an existence check. Visualise it as a column added onto the Alert set defined by the Outer query.
Never use * because we do not need the entire set of columns, and that will be slower
Likewise, failing to use a correlation, means a WHERE NOT IN () is required, but again, that constructs the defined column set, then compares the two sets. Much slower.

Subquery (5) Actioned Alerts

As an alternative constraint on the Outer query, for un-actioned Alerts, instead of (4), exclude the set of Actioned Alerts. Straight Correlated join.

The ▶SQL code◀ required is on page 5.

This code has been tested on Sybase ASE 15.0.3 using 1000 Alerts and 200 Acknowledgements, of different combinations; and the Readings and Alerts identified in the document. Zero milliseconds execution time (0.003 second resolution) for all executions.

If you need it, here is the ▶SQL Code in Text Format◀.

Response to Comments

(6) ▶Register Alert from Reading◀
This code executes in a loop (provided), selecting new Readings which are out-of-range, and creating Alerts, except where applicable Alerts already exist.

(7) ▶Load Alert From Reading◀
Given that you have a full set of test data for Reading, this code uses a modified form of (6) to load the applicable Alerts.

Common Problem

It is "simple" when you know how. I repeat, writing SQL without the ability to write Subqueries is very limiting; it is essential for handling Relational Databases, which is what SQL was designed for.

Half the reason developers implement unnormalised data heaps (massive data duplication) is because they cannot write the subqueries required for Normalised structures
- it is not that they have "denormalised for performance"; it is that they cannot code for Normalised. I have seen it a hundred times.
- Case in point here: you have a fully Normalised Relational Database, and the difficulty is coding for it, and you were contemplating duplicating tables for processing purposes.
And that is not counting the added complexity of a temporal database; or a 5NF temporal database.
Normalisation means Never Duplicate Anything, more recently known as Don't Repeat Yourself
Master Suqueries and you will be in the 98th percentile: Normalised, true Relational Databases; zero data duplication; very high performance.

I think you can figure out the remaining queries you have.

Relational Identifier

Note, this example also happens to demonstrate the power of using Relational Identifiers, in that several tables in-between the ones we want do not have to be joined (yes! the truth is Relational Identifiers means less, not more, joins, than Id keys). Simply follow the solid lines.

Your temporal requirement demands keys containing DateTime. Imagine trying to code the above with Id PKs, there would be two levels of processing: one for the joins (and there would be far more of them), and another for the data processing.

Label

I try to stay away from colloquial labels ("nested", "inner", etc) because they are not specific, and stick to specific technical terms. For completeness and understanding:

a Subquery after the FROM clause, is a Materialised View, a result set derived in one query and then fed into the FROM clause of another query, as a "table".
- The Oracle types call this Inline View.
- In most cases, you can write Correlated Subqueries as Materialised Views, but that is massively more I/O and processing (since Oracles handling of subqueries is abyssmal, for Oracle only, Materialised Views are "faster").
  .
A Subquery in the WHERE clause is a Predicate Subquery, because it changes the content of the result set (that which it is predicated upon). It can return either a Scalar (one value) or non-Scalar (many values).
- for Scalars, use WHERE column =, or any scalar operator
- for non-Scalars, use WHERE [NOT] EXISTS, or WHERE column [NOT] IN

A Suquery in the WHERE clause does not need to be Correlated; the following works just fine. Identify all superfluous appendages:

SELECT  [Never] = FirstName,
        [Acted] = LastName 
    FROM User 
    WHERE UserId NOT IN ( SELECT DISTINCT UserId
        FROM Action
        )

@PerformanceDBA: It's actually for a table you might be familiar with :-) I (nievely) left out some columns to focus in on what I was trying to do - however I probably shouldn't of done that. It's the reading table from our previous question. I'm trying to populate the alert table from it using a query. It's late here so I'm off to bed, so I'll come back to this tomorrow and try and clear up the question. PS - thanks for that response, I'll re-read it tomorrow when I'm awake as it looks like there's lots of useful info there. — Mark, Jan 14 '11 at 00:35
@Mark. I know that. What you left out is critical. Remember how we started ? Couldn't progress, due to `Id` stuck on everything. All problems solved in that first iteration by using appropriate **Identifiers**. Same again, the Identifiers are critical, to the code as well; crippled without them. All explained, in full tutorial form, so that you really understand how to write subqueries, fulfill all your needs. Read carefully, after coffee. — PerformanceDBA, Jan 14 '11 at 09:41
@Mark. Populate `Alert` from `ReadingValue` 1) ensure you understand the above Answer. 2) look at the early code for that, which is 3/4 into [this answer](http://stackoverflow.com/questions/4335189/opinions-on-sensor-reading-alert-database-design/4354034#4354034), confirm the context is correct. 3) Extend that code (just add Predicate Subqueries, Scalars). It is easy, took 15 mins, incl testing. But first understand the above Answer, and get some experience with Subqueries. — PerformanceDBA, Jan 14 '11 at 09:52
@PerformanceDBA: Thanks - all great info as usual. The queries you are showing are with the alerts table populated. At the moment I have a load of test data in the Readings table, but nothing in the Alerts table. I need to populate the Alerts table from the Readings table, and I figured it should be possible as the Alerts can be derived from the Readings. Granted this isn't normal operation of the system - it's just to make the test data complete. I could do it in code, but practising querying isn't a bad thing :-) — Mark, Jan 14 '11 at 10:56
@PerformanceDBA: Sorry only just saw your next comment after writing mine! — Mark, Jan 14 '11 at 10:57
@PerformanceDBA: I'll keep at it - the example you mention in the other question requires a loop doesn't it? Unless I'm misunderstanding.. — Mark, Jan 14 '11 at 11:03
@MArk. In Sybase it is a cinch, we have a switch to "allow dupe row" on the Index, which prints a warning instead of aborting. So we can use the brute force technique INSERT Alert SELECT FROM Reading WHERE Value > UpperLimit, and load it in one command. No idea re PostGreSQL. I could give you the [**real code**](http://www.softwaregems.com.au/Documents/Student%20Resolutions/Mark%20110113_4.sql) (as per Comment 3) and you could write required loop, modified so that it iterates through **every** time cycle, rather than just the latest time cycle. — PerformanceDBA, Jan 14 '11 at 11:13
@Mark. Yes, you have to write the loop. But you need that for the app anyway. Then use a modified form (one additional condition in the WHERE clause) to load Alert test data from Reading test data. — PerformanceDBA, Jan 14 '11 at 11:22
@PerformanceDBA: That's okay, I'll try and figure it out, it'll do me good to experiment for a while! For this particular question with the cut down table I provided, vc 74's answer is correct so strictly speaking I should mark that as the answer. — Mark, Jan 14 '11 at 11:24
@PerformanceDBA: Just noticed the "real code" link - thanks :-) — Mark, Jan 14 '11 at 11:35
@Mark. No worries, my answer was way beyond your question, I am not here for the points. Notice it again, I wrote the loop. You still need the vectors. And here is **the [modified loop](http://www.softwaregems.com.au/Documents/Student%20Resolutions/Mark%20110113_5.sql) to** load Alerts from Reading test data. — PerformanceDBA, Jan 14 '11 at 12:25
@PerformanceDBA: Thanks for that :-) I'm a little confused by the query that fetches alerts with each alert's peak value. In the sub query, the select statement fetches all readings whos ReadingDTM is greater than or equal to the the start date (fine) and who's value is less than or equal to the upper limit. Should it be who's value is greater than the upper limit? Also - to me that looks like it'd include *all* readings after the alert start date, including readings not within that alert's lifetime (i.e. it looks like it would including readings possibly occurring after a *later* alert).. — Mark, Jan 14 '11 at 16:16
@Mark. Not quite. You're right, I forgot one subquery when working between the SQL and the Answer, but that needs another two explainations. So I revised the Answer completely; tested further; updated the files. The links are now all in the Answer (those in the comments are no longer valid). — PerformanceDBA, Jan 15 '11 at 06:14
@PerformanceDBA: Thanks! That all looks great - and I think I even understand most of it now. I'm away for the weekend for my mum's birthday so it'll probably be Monday before I get the chance to really work through it. Thanks again and have a good weekend :-) — Mark, Jan 15 '11 at 09:17
@Mark. I made two tiny corrections to the doc, to ensure the code is applicable to the **generic** case. — PerformanceDBA, Jan 19 '11 at 22:18
@PerformanceDBA: quick question for you regarding the alert queries - would you say it's a good idea to stick them in "views" so I don't have to inline the large amounts of sql in the C++ code? For example, I could create a view called SensorAlertInfo which contains the EndDTM, PeakValue, etc.. — Mark, Jan 27 '11 at 15:49
@Mark. 1) Good idea, and a classic use of Views. I might call it `Alert_V` though; it could get the current or most recent `Alert` for the given `Sensor` . 2) After development, all SQL should be in stored procs: much faster because the SQL does not have to be compiled on every call; smaller code blocks in the calling app. — PerformanceDBA, Jan 27 '11 at 22:25
@PerformanceDBA: I've posted a new question discussing database history in general as it's a topic I find interesting. I'm especially interested in your thoughts on the Snodgrass book. I'm NOT looking to revisit the schema you put together for me, I'm extremely happy with that and currently working with it. — Mark, Feb 01 '11 at 23:40
@Mark. Great that the DM is complete. Can't find the question. Link please. — PerformanceDBA, Feb 02 '11 at 11:42
@PerformanceDBA: I deleted it as nobody seemed interested - I thought perhaps I had asked too many questions on the subject already! I could repost though as I wrote it out seperately first. — Mark, Feb 02 '11 at 14:03
@Mark. SO is a site for simple questions and quick answers. The simple responders either give wrong answers to big questions, or none at all. They've deleted a few questions that had good answers because "it wasn't a question", and shut others down. You are free to go to my profile page and email me. — PerformanceDBA, Feb 02 '11 at 22:56

vc 74 · Answer 2 · 2011-01-14T10:39:52.683

1

PostgreSQL supports window functions, have a look at this

[EDIT] Try the following:

SELECT TableDTM, Code FROM
(
    SELECT TableDTM,
           Code,
           LAG(Code, 1, NULL) OVER (ORDER BY TableDTM) AS PrevCode
    FROM   T
)
WHERE PrevCode<>Code OR PrevCode IS NULL;

edited Jan 14 '11 at 10:39

answered Jan 13 '11 at 16:13

vc 74

37,131
7
73
89

Nope :-( It misses out the first set of 5's. Close, but no cigar. – Mark Jan 13 '11 at 17:02
Try changing the final WHERE clause to "WHERE PrevCode<>Code OR PrevCode IS NULL" – nvogel Jan 13 '11 at 17:28
@dportas Thanks, I've updated the query accordingly (i can't test as I don't have access to a postgresql instance) – vc 74 Jan 14 '11 at 10:40
`WHERE PrevCode<>Code OR PrevCode IS NULL;` could be converted to `PrevCode IS DISTINCT FROM Code`. You're using Postgres anyway :-) – Michael Buen Jan 14 '11 at 11:09
Thanks, for the table I provided that works a treat :-) Now I just have to understand why. – Mark Jan 14 '11 at 11:25
@Mark in short, Lag retrieves a field value for a different row, in this case, from the previous row when the data is sorted by date (as specified in the over(order by TableDTM) clause). If PrevCode is different than Code, it means we are starting a new group and we are actually on the first row in that group which is precisely the rows you want to identify. – vc 74 Jan 14 '11 at 12:16

score 1 · Answer 3 · answered Jan 13 '11 at 17:01

1

Try this:

SELECT MIN(TableDTM) TableDTM, Code
FROM
(
    SELECT T1.TableDTM, T1.Code, MIN(T2.TableDTM) XTableDTM
    FROM T T1
    LEFT JOIN T T2
    ON T1.TableDTM <= T2.TableDTM
    AND T1.Code <> T2.Code
    GROUP BY T1.TableDTM, T1.Code
) X
GROUP BY XTableDTM, Code
ORDER BY 1;

answered Jan 13 '11 at 17:01

nvogel

24,981
1
44
82

That works! It is tremendously slow on a large table (over 50000 rows) - though that might be down to indexing, I'll see if I can fix that. – Mark Jan 13 '11 at 17:25

score 0 · Answer 4 · answered Jan 13 '11 at 16:10

0

could you try something like

"SELECT DISTINCT Code, (SELECT MIN(TableDTM) FROM T AS Q WHERE Q.Code = T.Code) As TableDTM FROM T;"

and if you need to exclude the 0, change it in:

 SELECT DISTINCT Code, (SELECT MIN(TableDTM) FROM T AS Q WHERE Q.Code = T.Code) As TableDTM FROM T WHERE Code <> 0;

answered Jan 13 '11 at 16:10

Ass3mbler

3,855
2
20
18

If that works I'm going to feel like a right prat :-) But a happy one. Thanks, going to have a go now - looks like it will work though. – Mark Jan 13 '11 at 16:17
It's really *really* slow on a table with 51425 rows, however that's probably an index issue. I'll see if I can sort it out. – Mark Jan 13 '11 at 16:25
It's insisting on doing a very expensive sequential scan on the table; I'm just trying to get my head round the output from EXPLAIN ANALYZE (in PostgreSQL) to work out why. – Mark Jan 13 '11 at 16:54
Actually there's no point, the results are wrong :-( I get back only 4 rows using the exact example in the question - so looks like a non starter. – Mark Jan 13 '11 at 16:57
@Mark I'm sorry, I've misread your question and I was thinking that you need only to get the first rows – Ass3mbler Jan 13 '11 at 18:46
@Mark. Without an index, for Heaps, a table scan is the only option. That is the reason all the responses will be slow. – PerformanceDBA Jan 14 '11 at 13:52

score 0 · Answer 5 · answered May 02 '22 at 23:53

Maybe I don't understand the question. But I don't see any mention of Common Table Expression or Analytic Functions. These are my weapons of choice for most problems, and when they can't handle it I start resorting to temporary tables.

I think, I recently solve a similar problem where I want to get the data of the first occurrence of an error when processing a daily interface file. Records on the interface that cause a problem are removed to a set of holding table so the rest of the records can be processed.

-- EE with errors removed from most recent batch
with current_batch as (
      select employee_number, PVL.ADDITIONAL_INFORMATION
      from PERSONNEL_VALIDATION_LOG_X PVL
      where PVL.PERSONNEL_BATCH_ID = EMPSRV.CURRENTPERSONNELBATCH(6,900)
)
, hist as (
  select 
    row_number() over (
      partition by X.EMPLOYEE_NUMBER, X.ADDITIONAL_INFORMATION
      order by B.BATCH_STATUS_DATE
    ) as RN,
    B.PERSONNEL_BATCH_ID BatchId,
    B.SUBMITTAL_DATE,
    X.EMPLOYEE_NUMBER EMPNUM,
    MX.LAST_NAME,
    MX.FIRST_NAME,
    X.ADDITIONAL_INFORMATION
  from PERSONNEL_VALIDATION_LOG_X X
  join current_batch C on
    X.Employee_number = C.EMPLOYEE_NUMBER
    and X.additional_information = C.ADDITIONAL_INFORMATION
  join empsrv.personnel_batch B 
    on B.PERSONNEL_BATCH_ID = X.PERSONNEL_BATCH_ID
  join EMPSRV.PERSONNEL_MEMBER_DATA_X MX
    on X.PERSONNEL_BATCH_ID = MX.PERSONNEL_BATCH_ID
      and X.EMPLOYEE_NUMBER = MX.EMPLOYEE_NUMBER
)
select 
  batchId, 
  to_char(submittal_date, 'mm/dd/yyyy') First_Reported,
  EmpNum, 
  Last_name, 
  first_name, 
  additional_information
from hist where rn = 1
order by submittal_date desc;

The first CTE just limits the population to current errors. The hist CTE goes through the logs and picks up the first occurrence of that error (ie. ame EE and messge) This isn't perfect because maybe the error went away and came back, I would get the oldest occurrence and not the start of the most recent sequence. But this is good enough and not likely due to the shape of the error message itself. The finally query just picks off the top row of each group which will be the first occurrence.

The query takes a few seconds to run, but my logs are not especially large, so performance is almost never an issue for me ever. I also don't pay much attent to the dates on the questions.