my setup (simplified for clarity) is following:
<int:inbound-channel-adapter channel="in" expression="0">
<int:poller cron="0 0 * * * *"/>
<int:header name="snapshot_date" expression="new java.util.Date()"/>
<int:header name="correlationId" expression="T(java.util.UUID).randomUUID()"/>
<!-- more here -->
</int:inbound-channel-adapter>
<int:recipient-list-router input-channel="in" apply-sequence="true">
<int:recipient channel="data.source.1"/>
<int:recipient channel="data.source.2"/>
<!-- more here -->
</int:recipient-list-router>
<int:chain input-channel="data.source.1" output-channel="save">
<int-jdbc:outbound-gateway data-source="db1" max-rows-per-poll="0">
<int-jdbc:query>
select * from large_dataset
</int-jdbc:query>
</int-jdbc:outbound-gateway>
<int:header-enricher>
<int:header name="source" value="data.source.1"/>
</int:header-enricher>
</int:chain>
<int:chain input-channel="data.source.2" output-channel="save">
<int-jdbc:outbound-gateway data-source="db1" max-rows-per-poll="0">
<int-jdbc:query>
select * from another_large_dataset
</int-jdbc:query>
</int-jdbc:outbound-gateway>
<int:header-enricher>
<int:header name="source" value="data.source.2"/>
</int:header-enricher>
</int:chain>
<int:chain input-channel="save" output-channel="process">
<int:splitter expression="T(com.google.common.collect.Lists).partition(payload, 1000)"/>
<int:transformer>
<int-groovy:script location="transform.groovy"/>
</int:transformer>
<int:service-activator expression="@db2.insertData(payload, headers)"/>
<int:aggregator/>
</int:chain>
<int:chain input-channel="process" output-channel="nullChannel">
<int:aggregator/>
<int:service-activator expression="@finalProcessing.doSomething()"/>
</int:chain>
let me explain the steps a little bit:
- poller is triggered by cron. message is enriched with some information about this run.
- message is sent to multiple data-source chains.
- each chain extracts data from large dataset (100+k rows). resultset message is marked with source header.
- resultset is split into smaller chunks, transformed and inserted into db2.
- after all data sources have been polled, some complex processing is initiated, using the information about the run.
this configuration does the job so far, but is not scalable. main problem is that i have to load full dataset into memory first and pass it along the pipeline, which might cause memory issues.
my question is - what is the simplest way to have resultset extracted from db1, pushed through the pipeline and inserted into db2 in small batches?