Fastest way to select a lot of rows based on their ID in PostgreSQL?

Question

I am using postgres with libpqxx, and I have a table that we will simplify down to

data_table
{
  bytea id PRIMARY KEY,
  BigInt size
}

If I have a set of ID's in cpp, eg std::unordered_set<ObjectId> Ids, what is the best way to get the ID and the Size parameters out of data_table?

I have so far used a prepared statement: constexpr char* preparedStatement = "SELECT size FROM data_table WHERE id = $1";

Then in a transaction I have called that prepared statement for every entry in the set, and retrieved the result for every entry in the set,

pqxx::work transaction(SomeExistingPqxxConnection);
std::unordered_map<ObjectId, uint32_t> result;
for (const auto& id : Ids)
{
  auto transactionResult = transaction.exec_prepared(preparedStatement, ToPqxxBinaryString(id));
  result.emplace(id, transactionResult[0][0].as<uint32_t>());
}
return result;

Because the set can contain tens of thousands of objects, and the table can contain millions, this can take quite some time to process, and I don't think it is a particularly efficient use of postgres.

I am pretty much brand new to SQL, so I don't really know if what I am doing is the right way to go about this, or if this is a much more efficient way.

E: For what it's worth the ObjectId class is basically a type wrapper over std::array<uint8_t, 32>, aka a 256 bit cryptographic hash.

Erwin Brandstetter · Answer 1 · 2020-07-01T03:44:02.300

The task as I understand it:

Get id (PK) and size (bigint) for "tens of thousands of objects" from a table with millions of rows and presumably several more columns ("simplified down").

The fastest way of retrieval is index-only scans. The cheapest way to get that in your particular case would be a "covering index" for your query by "including" the size column in the PK index like this (requires Postgres 11 or later):

CREATE TEMP TABLE data_table (
  id   bytea
, size bigint 
, PRIMARY KEY (id) INCLUDE (size)  -- !
)

About covering indexes:

Do covering indexes in PostgreSQL help JOIN columns?

Then retrieve all rows in a single query (or few queries) for many IDs at once like:

SELECT id, size
FROM   data_table
JOIN  (
    VALUES ('id1'), ('id2') -- many more 
    ) t(id) USING (id);

Or one of the other methods laid out here:

Query table by indexes from integer array

Or create a temporary table and join to it.
But do not "insert all those IDs one by one into it". Use the much faster COPY (or the meta-command \copy in psql) to fill the temp table. See:

How to update selected rows with values from a CSV file in Postgres?

And you do not need an index on the temporary table, as that one will be read in a sequential scan anyway. You only need the covering PK index I lined out.

You may want to ANALYZE the temporary table after filling it, to give Postgres some column statistics to work with. But as long as you get the index-only scans I am aiming for, you can skip that, too. The query plan won't get any better than that.

score 0 · Answer 2 · answered Jul 01 '20 at 02:03

0

The id is a primary key and so is indexed, so my first concern would be query setup time. A stored procedure is precompiled, for instance. A second tack is to put your set in a temp table, possibly also keyed on the id, so the two tables/indexes can be joined in one select. The indexes for this should be ordered, like tree not hash, so they can be merged.

answered Jul 01 '20 at 02:03

David G. Pickett

387
1
8

I did consider a temp table, but I was not sure if it would be faster to insert all those IDs one by one into it then do a join. – Alex Jul 01 '20 at 02:19
You can insert many at a time using insert select union select, say 128 at a time for 99+% savings. Many RDBMS have bulk import tools to load tables, which you might popen() and write to. I worked with an Oracle JAVA app that sent 4K rows at a time in an object that could be cast to a table for insert (into many tables using the Oracle multi-table insert). In Interbase, I wrote some UDFs and a stored procedure that together allowed me to cast a file as a table: the udfs opened the file first, read lines in the where to EOF and returned fields while the stored procedure generated fake rows. – David G. Pickett Jul 02 '20 at 14:54
You can also query 128 at a time, for instance in a join to a derived table of (select constant union select constant...). This cuts your setup time to <1%. Some RDBMS have facilities to cast a flat file as a table. In a google world, you have to imagine it to google for it to find it! So may ways to . . . . – David G. Pickett Jul 26 '20 at 18:21

Fastest way to select a lot of rows based on their ID in PostgreSQL?

2 Answers2