sql select from a large number of IDs

Question

I have a table, Foo. I run a query on Foo to get the ids from a subset of Foo. I then want to run a more complicated set of queries, but only on those IDs. Is there an efficient way to do this? The best I can think of is creating a query such as:

SELECT ... --complicated stuff
WHERE ... --more stuff
  AND id IN (1, 2, 3, 9, 413, 4324, ..., 939393)

That is, I construct a huge "IN" clause. Is this efficient? Is there a more efficient way of doing this, or is the only way to JOIN with the inital query that gets the IDs? If it helps, I'm using SQLObject to connect to a PostgreSQL database, and I have access to the cursor that executed the query to get all the IDs.

UPDATE: I should mention that the more complicated queries all either rely on these IDs, or create more IDs to look up in the other queries. If I were to make one large query, I'd end up joining six tables at once or so, which might be too slow.

score 6 · Accepted Answer · answered Jun 08 '10 at 22:55

6

One technique I've used in the past is to put the IDs into a temp table, and then use that to drive a sequence of queries. Something like:

BEGIN;
CREATE TEMP TABLE search_result ON COMMIT DROP AS
  SELECT entity_id
  FROM entity /* long complicated search joins and conditions ... */;
-- Fetch primary entities
SELECT entity_id, entity.x /*, ... */
FROM entity JOIN search_result USING (entity_id);
-- Fetch some related entities
SELECT entity_id, related_entity_id, related_entity.x /*, ... */
FROM related_entity JOIN search_result USING (entity_id);
-- And more, as required
END;

This is particularly useful where the search result entities have multiple one-to-many relationships which you want to fetch without either a) doing N*M+1 selects or b) doing a cartesian join of related entities.

answered Jun 08 '10 at 22:55

araqnid

127,052
24
157
134

is there any difference to doing this vs. creating a temporary view? – Claudiu Jun 08 '10 at 23:00
will a view keep running the query, but a temp table just stores the results? seems like what i want is the latter – Claudiu Jun 08 '10 at 23:01
2

By creating a temporary table, the actual results themselves are spooled, so the potentially-complicated search is only executed once. With a temporary view, the search conditions would be transferred into each subsequent statement. If your search conditions are simple, a view may be more efficient. If they are significantly complex, materialising the results into a temp table will win. I'd naturally start by using a temp table, and maybe have the app driving this use a temp view if it determines that the search conditions are simple enough by some rule-of-thumb. – araqnid Jun 08 '10 at 23:03
i'm a noob, but if i have a set of selects between a begin and end, like in your example, how do i actually get the results? I tried just modifying a query to be `BEGIN; query; END;`, and I got an error: `no results to fetch` – Claudiu Jun 08 '10 at 23:14
you can answer here, i asked a separate question: http://stackoverflow.com/questions/3002033/sql-get-data-out-of-begin-end-block-in-python – Claudiu Jun 08 '10 at 23:17
"BEGIN" and "END" are SQL statements themselves to start/finish a transaction. So you can skip them if your driver handles transactions already. (What I wrote is intended to work fine with psql). Note I used "ON COMMIT DROP" to have the temp table automatically get cleaned up at the transaction end: if your driver is not using transactions, you'll need to take that out to avoid getting errors about search_result not being found (you should also add a command to drop search_result manually, or be clever and truncate and reuse it). – araqnid Jun 08 '10 at 23:19
Crossed in the post... sorry for the confusion! Sorry I'm not too au fait with Python (in general) I've tried to answer as best I can. – araqnid Jun 08 '10 at 23:24

score 1 · Answer 2 · answered Jun 08 '10 at 22:54

1

I would think it might be useful to use a VIEW. Simple create a view with your query for ID's, then join to that view via ID. That will limit your results to the required subset of ID's without an expensive IN statement.

I do know that the IN statement is more expensive then an EXISTS statement would be.

answered Jun 08 '10 at 22:54

Nick Vallely

1,396
1
13
18

ah that sounds like just what I need. time to google how to create a VIEW. – Claudiu Jun 08 '10 at 22:56

score 0 · Answer 3 · answered Jun 08 '10 at 22:33

0

I think the join with the criteria to select the id's will be more efficient because the query optimizer has more options to do the right thing. Use the explain plan to see how postgresql will approach it.

answered Jun 08 '10 at 22:33

Peter Tillemans

34,983
11
83
114

ill try the explain plan. i updated to mention that i'd end up joining 6 tables or so, which seems like it'd take too long. – Claudiu Jun 08 '10 at 22:42
You never know, it is almost impossible to predict. The optimizer is smart and uses statistics from your tables. Make sure your database is vacuum analyzed. Looking at the explain plan often gives insights like missing indexes, or indexes which are there and not being used. – Peter Tillemans Jun 08 '10 at 22:53

score 0 · Answer 4 · answered Jun 09 '10 at 04:18

0

You are almost certainly better off with a join, however, another option is to use a sub select, i.e.

SELECT ... --complicated stuff
WHERE ... --more stuff
  AND id IN (select distinct id from Foo where ...)

answered Jun 09 '10 at 04:18

mhawke

84,695
9
117
138

sql select from a large number of IDs

4 Answers4