Most efficient way to select in bulk from a multi million records table

Question

I'm interested in getting and doing some processing on all the entities A returned by a query of the form:

 SELECT * FROM A a WHERE a.id not in (select b.id from B)

Where A is a "complex" entity in the sense that it inherits (InheritanceTyped.Joined) from other entities and that several of its attributes are other entities (@OneToOne and @ManyToOne).

The query itself takes a few minutes to yield results hence my desire to execute it as few as possible.

Here are the different approaches i tried to get those A elements as efficiently as possible :

Pagination using setFirstResult/ setMaxResults Do the job, but pretty slowly as the query seems to be executed everytime.(around 50 elements processed/sec)
Getting IDs first, A objects next Keeping all the IDs in memory is doable, so I execute once
```
SELECT a.id FROM A a WHERE a.id not in (select b.id from B)
```

and then select a from A a WHERE a.id= :id, which goes relatively fast as the id column is indexed. This is currently the solution that is the most efficient with (around 100 elements processed/sec)

Using ScollableResults I had high hope with this solution, but it ended up being slower than other alternatives, leaving me at around 20 elements processed/sec ...

As a neophyte, I don't know what other options to investigate, or if I did something wrong in any of my attempts.

Hence my questions:

Are there (factually) other approaches to efficiently tackle this kind of problem ?
Is it normal that ScrollableResults performed so poorly ? Is there something I should have paid attention to while implementing this solution?

EDIT: Here's the execution plan

Your current query won't even run on Oracle, so maybe you should fix your question first. Saying `WHERE a.id not in (select * from B)` is nonsensical, because `SELECT *` in general would return more than one column (unless table `B` coincidentally happen to have just one column). — Tim Biegeleisen, Feb 19 '19 at 07:48
*"getting and doing some processing on all the entities A"* - what sort of processing? — APC, Feb 19 '19 at 08:14
It is hard to say if your `A` and `B` are joins of several tables each. But the simple cursor processing should be the efficientest one, because you get rid of the `ORM` overhead and the *N +1* problem. Some care must be taken though, you problem with cursor can be as simple as that the DB is optimized to get `FIRST ROW` so you do not wait on the first row but the throughput is low. Without seeing the real SQL and the [execution plan](https://stackoverflow.com/a/34975420/4808122) it is hard to say anything. — Marmite Bomber, Feb 19 '19 at 08:29
@MarmiteBomber: That was my thoughts as well from what I read online... I updated the question with the execution plan — Loïc Gammaitoni, Feb 19 '19 at 08:57
The most efficient way of doing calculations and insertions into another table is to keep the data in the database. Nothing will perform better than a pure SQL `insert into ... select ... from` set operation. The bigger the tables the greater the impact of not doing it in the database. — APC, Feb 19 '19 at 11:39
This won't do as what I insert into the DB is all processed data, and the processing being quite complex, I can't do it in SQL. I don't have too much problem with the time the insertion take. It's the time taken by the select that gives me trouble — Loïc Gammaitoni, Feb 19 '19 at 11:54
Try converting your query to hibernate named query. Did you check the performance with Not Exists? — Atul, Feb 19 '19 at 16:45

Most efficient way to select in bulk from a multi million records table

0 Answers0