-1

I have a data aggregation and loading application that loads data files to an oracle database. We are using spring integration for file poling and spring batch for loading data to the database. When more than one file is being processed (read and loaded to database ) primary key tends to skip some values. Oracle sequence created is.

CREATE SEQUENCE  "SCHEMA"."SEQUENCE_TABLE_NAME"  MINVALUE 1 MAXVALUE 
    9999999999999999999999999999 INCREMENT BY 1 START WITH 241488161 CACHE 
    20 NOORDER  NOCYCLE ;

An inbound-channel-adapter has a poller which has as task executor. the inbound channel adapter sends the file to a transformer which creates a JobLaunchRequest object that is launched by the job-launching-gateway. The job has a reader and a jdbcwriter that executes the following statement.

<bean id="itemWriter" class="org.springframework.batch.item.database.JdbcBatchItemWriter">
    <property name="dataSource" ref="dataSource"/>
    <property name="sql">
        <value>
            <![CDATA[
            insert  into data_table (id,dataA,dataB)
            values(SEQUENCE_TABLE_NAME.nextval,:dataA,:dataB)
            ]]>
        </value>
    </property>
    <property name="itemSqlParameterSourceProvider">
        <bean
             class="org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider"/>
    </property>
</bean>

The key reason for loading to database is that an api is exposed which depends on the primary key being ordered and sequential. but once two many files are present in the folder. due to multiple threads reading and writing to the db. the primary key provided by the sequence table is missing some values in between.

A similar case is explained here (oracle perspective alone).

Oscar Makala
  • 1
  • 1
  • 3
  • I think your problem lies with `CACHE 20`, try reducing it to 1... – Usagi Miyamoto Jul 05 '17 at 07:00
  • @UsagiMiyamoto We have tried setting the commit interval similar to the cache size. then tried NO CACHE to not use cache at all but the same effect. but do note we have files that have more than 30000 records being uploaded every 2-5 minutes. disable cache means for every insert a read is made on disk – Oscar Makala Jul 05 '17 at 07:03
  • @ScaryWombat my code should not care. but as i stated. the rest api exposing data to third party has a requirement for sequential primary key. if missing values are present an audit issue is created. – Oscar Makala Jul 05 '17 at 07:05
  • @OscarMakala Thanks for the clarification – Scary Wombat Jul 05 '17 at 07:07
  • 1
    Missing values can be created by errors, too: `ROLLBACK` does not roll back sequences. Thus you should create an own implementation of a sequence that runs in transactions... – Usagi Miyamoto Jul 05 '17 at 07:09
  • @UsagiMiyamoto thank you.we have kept all that into consideration. could you share an approach for "create an own implementation of a sequence that runs in transactions." – Oscar Makala Jul 05 '17 at 07:12

3 Answers3

0

you should create an own implementation of a sequence that runs in transactions...

Try something like this:

CREATE TABLE schema.sequence_table ( seq BIGINT );

INSERT
  INTO schema.sequence_table
  VALUES (0);

CREATE OR REPLACE
  FUNCTION get_next_seq()
  RETURN BIGINT
  AS
DECLARE
  s BIGINT;
BEGIN
  SELECT COALESCE(seq) + 1
    INTO s
    FROM schema.sequence_table;
  UPDATE schema.sequence_table
    SET seq = s;
  RETURN s;
END;
Usagi Miyamoto
  • 6,196
  • 1
  • 19
  • 33
  • One should ensure that this won't cause too many waits by oracle. Otherwise all transactions involving this might be executed sequentially. And since the transaction spans the whole processing of the entry this can be a bottleneck. – SpaceTrucker Jul 05 '17 at 07:28
0

IMHO, if you have a requirement that no gap could exist in a column, then you should not use the sequence of the primary key for that, because the primary key must exist at INSERT time, and if the transaction is later aborted you will get a gap.

A simple solution would be to have a dedicated column distinct from the primary key that would be updated after a successful insertion by a single thread (or a dedicated single threaded application). This task would fetch recently inserted rows for example primary key above current sequence value minus a threshold (to speed up the request if table contains a great number of rows) and dedicated column value as null, assign them consecutive values, commit and iterate. Once per day at the moment of lowest load, the task should fetch all rows that could have a null value on the dedicated column (non indexed query) as a catch all pass.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
-1

First exposing a primary key via an API is generally a bad practice.

But here is another aproach to it:

Let all threads create those rows in the original table and don't care about the primary key at all. Create a new table that has a foreign key reference on the original table and a second column for storing the sequential value. A single thread will continously pick rows from the original table and create the rows in the new table using a sequence. Then expose the sequential value of the new table as the key in your API.

Since only that thread is accessing the sequence there should be no gaps. If you are not expecting the highest loads imaginable that single thread should be able to handle the load of multiple other threads.

There will be a latency introduced until an item becomes visible to your API because there is now a second transaction involved.

SpaceTrucker
  • 13,377
  • 6
  • 60
  • 99
  • @OscarMakala Would your client really care about the few millis between writing the original row and the row in the new table? If he really does then make the original request wait until the thread writing the new table picked up the threads entry. – SpaceTrucker Jul 05 '17 at 07:23
  • some streams have up to 100million records loaded in a day. please note there are more than one stream and creating an additional table also has an impact in the amount of storage. and having one thread work on this amount of data, it might not be able to finish processing within the window allowed. – Oscar Makala Jul 05 '17 at 07:24
  • @OscarMakala But even 100 million records per day are only 1000 records per second on average. The single thread writing the new table can use jdbc batching very well. – SpaceTrucker Jul 05 '17 at 07:29