Large file processing asynchronous and db optimization

Question

We are supposed to receive hourly (3h range) our daily report (24h range) this report will contain Entry matching (reference,id1,id2) to Transaction records in our DB (reference,id1,id2) if a match is found we need to include this record in our final comparison of the amount in the report and sum of calculation of matching record amount in our DB.
We are using JEE7 and so far we have come up with this approach :
1. Parse/De-serialize xml streaming Entry into a Java object using Jackson
2. Once in-memory threshold of 10K object is reaches sum/filter these objects and then serialize them to json before storing them temporarily in DB.
3. So by the end of complete parsing we will have filtered/sum and serialize to json data our Entry in db. We start parallel from now on to compare these Entry objects with our Transaction record.

Here I want some optimization because all we want is to compare these Entry(reference,id1,id2) = Transaction(reference,id1,id2) and if a match is found we consider this Transaction in our final calculation because it was part of our report. So essentially all (reference,id1,id2) should be unique.
I am reading about ora_hash function or multiple column index to speed up this lookup. Then there is some limitation with SQL IN Clause (1000 records) see here
We are also using Ejb Async method to achieve this parallelism and still need to work on knowing when this asynch call ends don't want to introduce Thread delay.

I want to know few optimizations in this approach and in general if this approach is ok after we mark all our Transaction on RHS we do as step 4 is to query and sum all these amounts and match with the one came with this report, this concludes our process.

score 0 · Answer 1 · answered Jan 08 '19 at 21:40

You are going through way too many intermediate steps here. If you are blocking on I/O in the database, you can use a Pub/Sub pattern (via the Ejb Async method) to keep the Database maxed out, but I'd test with a single thread first. You will likely saturate your I/O and be blocked anyway.

If you are already deserialized (from XML), why don't you just write the reference, id1, and id2 to a local table, then join it with your transaction table for your report? Why are you going through all these intermediate steps (JSON?) Don't try and beat the database when it comes to doing large amounts of data comparison, it's going to be better at it than you are.

Honestly, depending on your RDMS, you might be solving this totally the wrong way. Most RDMS solutions have the ability to parse an XML file directly:

score 0 · Answer 2 · answered Jan 08 '19 at 22:01

The typical approach in such a situation is:

Load the data from the file into a staging table
Run the report on the database (including the staging table)

Data loading

The staging table should have the three columns reference, id1 and id2 and the appropriate indices to efficiently join with the Transaction table.

You can efficiently load the XML data using the Oracle SQL*Loader. But since you already have a lot of Java code for it, you can also load it using Java and JDBC. For maximum performance, use JDBC batch processing.

Run report

When you run the report, join against the staging table, e.g.:

SELECT SUM(amount)
FROM Transaction t
JOIN Staging s on t.reference = t.reference
    AND t.id1 = s.id1 AND t.id2 = s.id2;

There's most likely no need for parallelism in this setup. It just makes use of the fact the relational database systems are efficient at processing big chunks of data (as opposed to record by record).

Large file processing asynchronous and db optimization

2 Answers2