-1

I want to use a PostgreSQL table as a kind of work queue for documents. Each document has an ID and is stored in another, normal table with lots of additional columns. But this question is about creating the table for the work queue.

I want to create a table for this queue without OIDs with just one column: The ID of the document as integer. If an ID of a document exists in this work queue table, it means that the document with that ID is dirty and some processing has to be done. The extra table shall avoid the VACUUM and dead tuple problems and deadlocks with transactions that would emerge if there was just a dirty bit on each document entry in the main document table.

Many parts of my system would mark documents as dirty and therefore insert IDs to process into that table. These inserts would be for many IDs in one transaction. I don't want to use any kind of nested transactions and there doesn't seem to be any kind of INSERT IF NOT EXISTS command. I'd rather have duplicate IDs in the table. Therefore duplicates must be possible for the only column in that table.

The process which processes the work queue will delete all processes IDs and therefore take care of duplicates. (BTW: There is another queue for the next step, so regarding race conditions the idea should be clean and have no problem)

But also I want the documents to be processed in order: Always shall documents with smaller IDs be processed first.

Therefore I want to have an index which aids LIMIT and ORDER BY on the ID column, the only column in the workqueue table. Ideally given that I have only one column, this should be the primary key. But the primary key must not have duplicates, so it seems I can't do that.

Without the index, ORDER BY and LIMIT would be slow.

I could add a normal, secondary index on that column. But I fear PostgreSQL would add a second file on disc (PostgreSQL does that for every additional index) and use the double amount of disc operations for that table.

What is the best thing to do? Add a dummy column with something random (like the OID) in order to make the primary key not complain about duplicates? Must I waste that space in my queue table?

Or is adding the second index harmless, would it become kind of the primary index which is directly in the primary tuple btree?


Shall I delete everything above this and just leave the following? The original question is distracting and contains too much unrelated information.

I want to have a table in PostgreSQL with these properties:

  • One column with an integer
  • Allow duplicates
  • Efficient ORDER BY+LIMIT on the column
  • INSERTs should not do any query in that table or any kind of unique index. INSERTs shall just locate the best page for the main file/main btree for this table and just insert the row in between to other rows, ordered by ID.
  • INSERTs will happen in bulk and must not fail, expect for disc full, etc.
  • There shall not be additional btree files for this table, so no secondary indexes
  • The rows should occupy not much space, e.g. have no OIDs

I cannot think of a solution that solves all of this.

My only solution would compromise on the last bullet point: Add a PRIMARY KEY covering the integer and also a dummy column, like OIDs, a timestamp or a SERIAL.

Another solution would either use a hypothetical INSERT IF NOT EXISTS, or nested transaction or a special INSERT with a WHERE. All these solutions would add a query of the btree when inserting. Also they might cause deadlocks.

(Also posted here: https://dba.stackexchange.com/q/45126/7788)

Community
  • 1
  • 1
Christian
  • 2,903
  • 4
  • 31
  • 34
  • 1
    `Many parts of the system would insert IDs to process into that table. Therefore duplicates must be possible.` IMHO your thinking is wrong here: if the id is not unique is is not an id. Possibly `{datasource,document_id}` could serve as a candidate key? – wildplasser Jun 23 '13 at 19:50
  • 2
    What kind of queue? First-in-first-out? And what do 5 rows having the same ID mean? – Mike Sherrill 'Cat Recall' Jun 23 '13 at 19:56
  • Thanks a lot for your very good questions! I have edited the post to make it clearer. Sorry for that. There is a proper table with columns and a unique ID for each document. This question here is about an additional work queue table. It's not a queue per se, but just an outsourced dirty-table. 5 rows with the same ID mean the same thing as 1 or 10 rows with that same ID: They mean that the document with that ID is dirty. – Christian Jun 23 '13 at 20:18
  • 2
    It's not clear at all. Can a document be in the queue several times? (it seems yes.) But then what does that mean? That the document is processed by several other processes? – ypercubeᵀᴹ Jun 23 '13 at 20:22
  • And why do you think there is no `INSERT IF NOT EXISTS`? No with these exacts keywords (the `IF` specifically) but this functionality is easy to write with `INSERT .. SELECT .. WHERE` – ypercubeᵀᴹ Jun 23 '13 at 20:25
  • A document can logically only be in the queue once (when it's dirty). But there can be multiple rows because I want to do very fast bulk inserts into that table to mark it as dirty. These inserts shall not concern themselves with any unique checks and the extra storage for duplicates doesn't matter to me. Sorry for structuring the question so badly! I should probably delete everything about the queue. – Christian Jun 23 '13 at 21:30
  • Do you think that a simple unique index check (without any writing to disk if the check is positive) will be less efficient than writing to disk every time (and then finding and deleting the multiple, scattered appearances of a document ID when it is to be removed)? – ypercubeᵀᴹ Jun 23 '13 at 21:35
  • Yes, I believe that an unique index with all the overhead is less efficient than having duplicates: Usually documents will not be marked as dirty by multiple processes at the same time. Also I don't want any deadlocks. Actually I use a column in the main table now which says that it's dirty. I have much trouble with VACUUM and deadlocks. – Christian Jun 23 '13 at 21:39
  • Seems to have been re-posted in rewritten form to http://dba.stackexchange.com/q/45126/7788 – Craig Ringer Jun 24 '13 at 02:34
  • You have not explained the background. The *why* for this *how*. Without that, there's too much guesswork. – Craig Ringer Jun 24 '13 at 02:38

2 Answers2

3

You said

Many parts of my system would mark documents as dirty and therefore insert IDs to process into that table. Therefore duplicates must be possible.

and

5 rows with the same ID mean the same thing as 1 or 10 rows with that same ID: They mean that the document with that ID is dirty.

You don't need duplicates for that. If the only purpose of this table is to identify dirty documents, a single row containing the document's id number is sufficient. There's no compelling reason to allow duplicates.

A single row for each ID number is not sufficient if you need to track which process inserted that row, or order rows by the time they were inserted, but a single column isn't sufficient for that in the first place. So I'm sure a primary key constraint or unique constraint would work fine for you.

Other processes have to ignore duplicate key errors, but that's simple. Those processes have to trap errors anyway--there are a lot of things besides a duplicate key that can prevent an insert statement from succeeding.


An implementation that allows duplicates . . .

create table dirty_documents (
  document_id integer not null
);

create index on dirty_documents (document_id);

Insert 100k ID numbers into that table for testing. This will necessarily require updating the index. (Duh.) Include a bunch of duplicates.

insert into dirty_documents 
select generate_series(1,100000);

insert into dirty_documents
select generate_series(1, 100);

insert into dirty_documents
select generate_series(1, 50);

insert into dirty_documents
select generate_series(88000, 93245);

insert into dirty_documents
select generate_series(83000, 87245);

Took less than a second on my desktop, which isn't anything special, and which is running three different database servers, two web servers, and playing a Rammstein CD.

Pick the first dirty document ID number for cleaning up.

select min(document_id) 
from dirty_documents; 

document_id
--
1

Took only 0.136 ms. Now lets delete every row that has document ID 1.

delete from dirty_documents
where document_id = 1; 

Took 0.272 ms.

Let's start over.

drop table dirty_documents;
create table dirty_documents (
  document_id integer primary key
);

insert into dirty_documents 
select generate_series(1,100000); 

Took 500 ms. Let's find the first one again.

select min(document_id) 
from dirty_documents; 

Took .054 ms. That's about half the time it took using a table that allowed duplicates.

delete from dirty_documents
where document_id = 1;

Also took .054 ms. That's roughly 50 times faster than the other table.

Let's start over again, and try an unindexed table.

drop table dirty_documents;
create table dirty_documents (
  document_id integer not null
);

insert into dirty_documents 
select generate_series(1,100000);

insert into dirty_documents
select generate_series(1, 100);

insert into dirty_documents
select generate_series(1, 50);

insert into dirty_documents
select generate_series(88000, 93245);

insert into dirty_documents
select generate_series(83000, 87245);

Get the first document.

select min(document_id) 
from dirty_documents; 

Took 32.5 ms. Delete those documents . . .

delete from dirty_documents
where document_id = 1;

Took 12 ms.

All of this took me 12 minutes. (I used a stopwatch.) If you want to know what performance will be, build tables and write tests.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • I don't want to track which process marked it as dirty. It's only like a flag. How is it simple to trap duplicate key errors? I don't want to start one nested transaction per INSERT. I want the raw performance of inserting rows into the table, I don't want PostgreSQL to check or update any indexes! The question is whether PostgreSQL can somehow not check for duplicates (like PRIMARY KEY does), but use it anyway as the main on disc order for storing the data in the main betreff (like PRIMARY KEY does) – Christian Jun 23 '13 at 21:27
  • On the one hand, you say, "Therefore I want to have an index..." On the other hand, you say "I don't want PostgreSQL to check or update any indexes". Which one do you want? You can create an index on *any* column: `CREATE INDEX ON table_name (column_name);` But if you do that, PostgreSQL will have to update the index when you insert a row, and it will have to update the index when you delete a row. – Mike Sherrill 'Cat Recall' Jun 23 '13 at 21:32
  • Actually it's possible to have both: PostgreSQL stores all rows in a btree. I want that btree to be ordered by the ID. I believe that the primary KEY does exactly that, but forces a duplicate check on me. So I want the main btree to be ordered so that ORDER BY+LIMIT are fast. – Christian Jun 23 '13 at 21:41
  • I just told you how to do that. *You can create an index on any column: `CREATE INDEX ON table_name (column_name);` But if you do that, PostgreSQL will have to update the index when you insert a row, and it will have to update the index when you delete a row.* An index like this allows duplicates; it doesn't allow you to avoid updating it. – Mike Sherrill 'Cat Recall' Jun 23 '13 at 22:05
  • I also wrote about a secondary index in my original question. But I want to avoid the overhead that comes with that. Is there any solutions which neither needs a second btree on disc, nor additional space for a dummy column to appease the primary index? – Christian Jun 23 '13 at 22:15
  • @Cristiam: please do stop thinking _sequentially_. (I read your posting history, and you appear to be a coder, not a database person). All your concerns about INDEX and LIMIT, and ORDER of processing could just as well be futile. Try to view a databasetable as a _collection of facts_ : if there is a record, there is a fact. In your case: if there is a record in the queue/worklist: somebody wants this item to be processed. Please don't optimise prematurely. (how many _facts_ do you want to be processed per second? ) – wildplasser Jun 23 '13 at 22:37
  • There is no premature optimization. I have worked with the old database scheme (with the dirty flag) for many years now and know all it's in and outs, so forgive me that I want to have this exact table properties. This is not a database of facts, it's a database of documents and the database mainly controls how the documents are processed. And it's HUGE. I really need this database to do exactly these things as fast as possible and without many dead tuples. I'd really rather use a btree directly, but can't change from PostgreSQL for various reasons. – Christian Jun 23 '13 at 22:49
  • `so forgive me that I want to have this exact table properties. This is not a database of facts, it's a database of documents` Please learn about data modelling: a table is a series of (similar) facts. – wildplasser Jun 23 '13 at 23:34
  • 2
    *"There is no premature optimization."* There is if you haven't written and tested any code. That's pretty much the definition. – Mike Sherrill 'Cat Recall' Jun 23 '13 at 23:49
  • 2
    What @Christian appears to want is an *indexed organised table*. This isn't supported in PostgreSQL at present. I'm not convinced it's actually the right answer to the real underlying problem here, but that's roughly what you're describing. – Craig Ringer Jun 24 '13 at 02:37
  • Running your script with the primary key and then doing an "select * from pg_class;" already shows two relations on disc, one with the pkey. Therefore what I wanted (seems to be an "indexed organised table") is just simply not available, PostgreSQL seems to just not use the main relation's order at all. Given this your benchmarks seem to be spot on. Thanks a lot, I'll check them in detail later – Christian Jun 24 '13 at 06:29
2

Reading between the lines, I think you're trying to implement a work-queueing system.

Stop. Now.

Work queueing is hard. Work queuing in a relational DBMS is very hard. Most of the "clever" solutions people come up with end up serializing work on a lock without them realising it, or they have nasty bugs in concurrent operation.

Use an existing message/task queueing system. ZeroMQ, RabbitMQ, PGQ, etc etc etc etc. There are lots to choose from and they have the significant advantages of (a) working and (b) being efficient. You'll most likely need to run an external helper process or server, but the limitations of the relational database model tend to make that necessary.

The scheme you seem to be envisioning, as best as I can guess, sounds like it'll suffer from hopeless concurrency problems when it comes to failure handling, insert/delete races, etc. Really, do not try to design this yourself, especially when you don't have a really good grasp of the underlying concurrency and performance issues.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • There will be only one consumer of this queue, so on the receiving side there shouldn't be too many concurrency issues. Also the receiving side does not receive real "jobs" with data to process, only process dirty documents. I appreciate it very very much that you get what I want (index organized table) and your questions linked on dba stackexchange show that you really know the details of concurrent transactions (what really happens). I think you are right in general, but adding one of these new systems would add much complications to the whole system in complexity, RAM usage and fsync-churn. – Christian Jun 24 '13 at 05:54
  • @Christian Even in a system with one consumer (and even with just one producer too) there will still be concurrency problems with your approach; in particular I'm concerned you'll lose dirty flags when a document is dirtied again just as something finishes processing it. It's all to hand-wavey and ill-defined to be sure, though. I think you are making a mistake by trying to do this yourself. Use an existing solution that's tested and that works. – Craig Ringer Jun 24 '13 at 06:10
  • The race condition concern is spot on and would have cause much trouble, but from the question: "(BTW: There is another queue for the next step, so regarding race conditions the idea should be clean and have no problem)" (a race in step 1 is harmless). I accept that PostgreSQL is not well suited for this usage and that I should be using an MQ-solution, but I am afraid I can't afford that at this time, maybe in a later round of system changes. I also don't know how to synchronize insertion into the queue with other updates in the real database, they surely can't share one transaction/WAL/fsync. – Christian Jun 24 '13 at 07:36
  • 1
    @CraigRinger: Do queueing systems support the pattern of "any id number in any number of times, lowest id number out"? – Mike Sherrill 'Cat Recall' Jun 24 '13 at 11:30
  • 2
    @MikeSherrill'Catcall' I found that somewhat ill-defined. If the author means "FIFO" then yes. If they really mean "lowest ID first" then I can't say so for sure for any I've worked with... but "lowest ID first" is again a bit of an odd strategy. – Craig Ringer Jun 24 '13 at 11:42