I propose 2 more use cases for this partial reduction:
1. Parsing SQL and PL/SQL (Oracle procedural) statements
Standard delimiter for SQL statements is semicolon (;
). It separates normal SQL statements from each other. But if you have PL/SQL statement then semicolon separates operators inside statement from each other, not only statements as whole.
One of the ways of parsing script file containing both normal SQL and PL/SQL statements is to first split them by semicolon and then if particular statement starts with specific keywords (DECLARE
, BEGIN
, etc.) join this statement with next statements following rules of PL/SQL grammar.
By the way, this cannot be done by using StreamEx
partial reduce operations since they only test two adjacent elements. Since you need to know about previous stream elements starting from initial PL/SQL keyword element to determine whether or not to include current element into partial reduction or partial reduction should be finished. In this case mutable partial reduction may be usable with collector holding information of already collected elements and some Predicate
testing either only collector itself (if partial reduction should be finished) or BiPredicate
testing both collector and current stream element.
In theory, we're speaking about implementing LR(0) or LR(1) parser (see https://en.wikipedia.org/wiki/LR_parser) using Stream pipeline ideology. LR-parser can be used to parse syntax of most programming languages.
Parser is a finite automata with stack. In case of LR(0) automata its transition depends on stack only. In case of LR(1) automata it depends both on stack and next element from the stream (theoretically there can be LR(2), LR(3), etc. automatas peeking 2, 3, etc. next elements to determine transition but in practice all programming languages are syntactically LR(1) languages).
To implement parser there should be a Collector
containing stack of finite automata and predicate testing whether final state of this automata is reached (so we can stop reduction). In case of LR(0) it should be Predicate
testing Collector
itself. And in case of LR(1) it should be BiPredicate
testing both Collector
and next element from stream (since transition depends on both stack and next symbol).
So to implement LR(0) parser we would need something like following (T
is stream elements type, A
is accumulator holding both finite automata stack and result, R
is result of each parser work forming output stream):
<R,A> Stream<R> Stream<T>.parse(
Collector<T,A,R> automataCollector,
Predicate<A> isFinalState)
(i removed complexity like ? super T
instead of T
for compactness - result API should contain these)
To implement LR(1) parser we would need something like following:
<R,A> Stream<R> Stream<T>.parse(
BiPredicate<A, T> isFinalState
Collector<T,A,R> automataCollector)
NOTE: In this case BiPredicate
should test element before it would be consumed by accumulator. Remember LR(1) parser is peeking next element to determine transition. So there can be a potential exception if empty accumulator rejects to accept next element (BiPredicate
returns true, signalizing that partial reduction is over, on empty accumulator just created by Supplier
and next stream element).
2. Conditional batching based on stream element type
When we're executing SQL statemens we want to merge adjacent data-modification (DML) statements into a single batch (see JDBC API) to improve overall performance. But we don't want to batch queries. So we need conditional batching (instead of unconditional batching like in Java 8 Stream with batch processing).
For this specific case StreamEx
partial reduce operations can be used since if both adjacent elements tested by BiPredicate
are DML statements they should be included into batch. So we don't need to know previous history of batch collection.
But we can increase complexity of the task and say that batches should be limited by size. Say, no more than 100 DML statements in a batch. In this case we cannot ignore previous batch collection history and using of BiPredicate
to determine whether batch collection should be continued or stopped is insufficient.
Though we can add flatMap after StreamEx
partial reduction to split long batches into parts. But this would delay specific 100-element batch execution until all DML statements would be collected into unlimited batch. Needless to say that this is against pipeline ideology: we want to minimize buffering to maximize speed between input and output. Moreover, unlimited batch collection may result in OutOfMemoryError
in case of very long list of DML statements without any queries in between (say, million of INSERT
s as a result of database export) which is intolerable.
So in case of this complex conditional batch collection with upper limit we also need something as powerful as LR(0) parser described in previous use case.