6

Here's my situation. I have a table with a bunch of URLs and crawl dates associated with them. When my program processes a URL, I want to INSERT a new row with a crawl date. If the URL already exists, I want to update the crawl date to the current datetime. With MS SQL or Oracle I'd probably use a MERGE command for this. With mySQL I'd probably use the ON DUPLICATE KEY UPDATE syntax.

I could do multiple queries in my program, which may or may not be thread safe. I could write a SQL function which has various IF...ELSE logic. However, for the sake of trying out Postgres features I've never used before, I'm thinking about creating an INSERT rule - something like this:

CREATE RULE Pages_Upsert AS ON INSERT TO Pages
  WHERE EXISTS (SELECT 1 from Pages P where NEW.Url = P.Url)
  DO INSTEAD
     UPDATE Pages SET LastCrawled = NOW(), Html = NEW.Html WHERE Url = NEW.Url;

This seems to actually work great. It probably loses some points on the "code readability" standpoint, as someone looking at my code for the first time would have to magically know about this rule, but I guess that could be solved with good code commenting and documentation.

Are there any other drawbacks to this idea, or maybe a "your idea sucks, you should do it /this/ way instead" comment? I'm on PG 9.0 if that matters.

UPDATE: Query plan since someone wanted it :)

"Insert  (cost=2.79..2.81 rows=1 width=0)"
"  InitPlan 1 (returns $0)"
"    ->  Seq Scan on pages p  (cost=0.00..2.79 rows=1 width=0)"
"          Filter: ('http://www.foo.com'::text = lower((url)::text))"
"  ->  Result  (cost=0.00..0.01 rows=1 width=0)"
"        One-Time Filter: ($0 IS NOT TRUE)"
""
"Update  (cost=2.79..5.46 rows=1 width=111)"
"  InitPlan 1 (returns $0)"
"    ->  Seq Scan on pages p  (cost=0.00..2.79 rows=1 width=0)"
"          Filter: ('http://www.foo.com'::text = lower((url)::text))"
"  ->  Result  (cost=0.00..2.67 rows=1 width=111)"
"        One-Time Filter: $0"
"        ->  Seq Scan on pages  (cost=0.00..2.66 rows=1 width=111)"
"              Filter: ((url)::text = 'http://www.foo.com'::text)"
A.H.
  • 63,967
  • 15
  • 92
  • 126
Mike Christensen
  • 88,082
  • 50
  • 208
  • 326
  • Please add the resulting query plan (for an INSERT action on the target table) to the post. – wildplasser Sep 18 '11 at 19:58
  • BTW, there is actually an INDEX on lower(url) on pages - I think it's doing a sequential scan because there's only 4 rows in the table at the moment, and the planner thinks it's faster to just look at 4 rows than to hit an index. Just a guess. – Mike Christensen Sep 18 '11 at 20:22
  • Thanks. I got about the same. The update part always takes place. If you do an actual INSERT INTO pages(...) VALUES (...) the value for lastcrawled is overwritten by now(), even if the key was not present before the insert. – wildplasser Sep 18 '11 at 20:25

5 Answers5

5

Ok, I managed to create a testcase. The result is that the update part is always executed, even on a fresh insert. COPY seems to bypass the rule system. [For clarity I have put this into a separate reply]

DROP TABLE pages CASCADE;
CREATE TABLE pages
    ( url VARCHAR NOT NULL  PRIMARY KEY
    , html VARCHAR
    , last TIMESTAMP
    );

INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );

CREATE RULE Pages_Upsert AS ON INSERT TO pages
  WHERE EXISTS (SELECT 1 from pages P where NEW.url = P.url)
     DO INSTEAD (
     UPDATE pages SET html=new.html , last = NOW() WHERE url = NEW.url
    );

INSERT INTO pages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );

INSERT INTO pages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM pages pp;

COPY pages(url,html,last) FROM STDIN;
www.example.com://pageX     stdin   2000-09-18 23:30:00
\.

SELECT * FROM pages;

The result:

              url              |    html    |            last            
-------------------------------+------------+----------------------------
 www.example.com://page1       | meuk1      | 2001-09-18 23:30:00
 www.example.com://page2       | meuk2      | 2011-09-18 23:48:30.775373
 www.example.com://page3       | meuk3      | 2011-09-18 23:48:30.783758
 www.example.com://page1/added | meuk1.html | 2011-09-18 23:48:30.792097
 www.example.com://page2/added | meuk2.html | 2011-09-18 23:48:30.792097
 www.example.com://page3/added | meuk3.html | 2011-09-18 23:48:30.792097
 www.example.com://pageX       | stdin      | 2000-09-18 23:30:00
 (7 rows)

UPDATE: Just to prove it can be done:

INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );
CREATE VIEW vpages AS (SELECT * from pages);

CREATE RULE Pages_Upsert AS ON INSERT TO vpages
  DO INSTEAD (
     UPDATE pages p0
     SET html=NEW.html , last = NOW() WHERE p0.url = NEW.url
    ;
     INSERT INTO pages (url,html,last)
    SELECT NEW.url, NEW.html, NEW.last
        WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = NEW.url)
    );

CREATE RULE Pages_Indate AS ON UPDATE TO vpages
  DO INSTEAD (
     INSERT INTO pages (url,html,last)
    SELECT NEW.url, NEW.html, NEW.last
        WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = OLD.url)
        ;
     UPDATE pages p0
     SET html=NEW.html , last = NEW.last WHERE p0.url = NEW.url
        ;
    );

INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );

INSERT INTO vpages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM vpages pp;
UPDATE vpages SET last = last + interval '-10 years' WHERE url = 'www.example.com://page1' ;

-- Copy does NOT work on views
-- COPY vpages(url,html,last) FROM STDIN;
-- www.example.com://pageX    stdin    2000-09-18 23:30:00
-- \.

SELECT * FROM vpages;

Result:

INSERT 0 1
INSERT 0 1
INSERT 0 3
UPDATE 1
              url              |    html    |        last         
-------------------------------+------------+---------------------
 www.example.com://page2       | meuk2      | 2002-09-18 23:30:00
 www.example.com://page3       | meuk3      | 2003-09-18 23:30:00
 www.example.com://page1/added | meuk1.html | 2021-09-18 23:30:00
 www.example.com://page2/added | meuk2.html | 2022-09-18 23:30:00
 www.example.com://page3/added | meuk3.html | 2023-09-18 23:30:00
 www.example.com://page1       | meuk1      | 1991-09-18 23:30:00
(6 rows)

The view is necessary to prevent the rewrite system to go into recursion. Construction of a DELETE rule is left as an exercise to the reader.

wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • A lot of good feedback on this thread, but marking this one as the answer because you put a lot of work into it. – Mike Christensen Sep 19 '11 at 21:19
  • Thanks. I recently investigated the stuff, so I got most of it handy. The conclusion is: test it. Test all of the cornercases. The "SELECT ... from tab INSERT into tab()" can be a real showstopper, if you seduce the rewriter to use the wrong rangetable entries and/or merge subqueries or clauses. EXPLAIN is your friend. – wildplasser Sep 19 '11 at 21:47
2

Some good points from someone who should know it or be very near to someone like that ;-)

What are PostgreSQL RULEs good for?

Short story:

  • Do the rules work well with SERIAL and BIGSERIAL ?
  • Do the rules work well with the RETURNING clauses of INSERT and UPDATE ?
  • Do the rules work well with stuff like random()?

All these things boils down to the fact, that the rule system is not row driven but transforms your statements in a way you never imagine.

Do yourself and your team mates a favour and stop using roles for things like that.

Edit: Your problem is well discussed in the PostgreSQL community. Search keywords are: MERGE, UPSERT.

Community
  • 1
  • 1
A.H.
  • 63,967
  • 15
  • 92
  • 126
1

I don't know if this gets too subjective but what I think about your solution is: It's all about semantics. When I do an insert, I expect an insert and not some fancy logic that maybe does an insert but maybe not. Indeed that's what functions are for.

At first I'd try checking for the URL in your program and then choosing whether to insert or update. If that turned out to be too slow, I'd use a function. If you name it like insert_or_update_url, you automatically get some documentation for free. The rewrite rule requires you to have some implicit knowledge and I generally try to avoid that.

On the plus side: If someone copies the data but forgets rules and functions, your solution might break silently (but that may depend on other constraints), but a missing function goes down screaming. Don't get me wrong, I think your solution is very creative and smart. Just a bit too obscure for my taste.

musiKk
  • 14,751
  • 4
  • 55
  • 82
  • Yea I kinda agree. Doing a check to see if the row exists first will be too slow and isn't transaction safe if many instances of my program are running at once. A PG function might fix this and be faster. The table schema prohibits the same URL from existing twice so copying the data isn't an issue, plus this data is transient in nature so not super important. Maybe one of these days, PG will have an UPSERT ability and this won't be an issue. – Mike Christensen Sep 18 '11 at 19:47
0

There's an example of implementing upsert / merge using simple function in Postgres documentation.

Never use rules — they're evil.

Tometzky
  • 22,573
  • 5
  • 59
  • 73
  • I agree with "don't use rules" as a rule of thumb. But they are not evil per se; they are just tricky. Hard to get right. The devil is in the detail. It is a good way to learn understand query plans though ;-] – wildplasser Sep 18 '11 at 22:14
  • @wildplasser - They're _GOTO_-level evil. If you even somehow do them right 95% of other developers won't. Almost everything that is implemented using a rule can be implemented much clearer and easier to understand and more portable using for example a trigger or simply a function. EVIL I say. – Tometzky Sep 19 '11 at 05:56
  • Make that 99%. The problem is that most developers think in terms of "row at a time" processing, which matches nicely with triggers. Considering rules as "triggers on steroids" is a result of this broken paradigm. To extend your _GOTO_ metaphor: the rule qualification in the OP is comparable to "void main()" in a C program. It is just plain wrong, even if it happens to work in some particular case. The fact that I got voted down for mentioning it supports your 95% rule. I rest my case. – wildplasser Sep 19 '11 at 09:44
-1

You cannot refer to other tables than old an new in the rule qualification. You should instead do this in the rule body. This is all because the rule is just a way to inform the rewrite system about what transformations it should and should not perform. Rules are not triggers, executing for every row, but they give the query planner a fine massage and ask it nicely to rewrite the plan. From the docs:

What is a rule qualification? It is a restriction that tells when the actions of the rule should be done and when not. This qualification can only reference the pseudorelations NEW and/or OLD, which basically represent the relation that was given as object (but with a special meaning).

wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • your "WHERE EXISTS (SELECT 1 from Pages P where NEW.Url = P.Url)" qualification is not valid. Have you tried it? – wildplasser Sep 18 '11 at 19:47
  • I Just tested it. It is accepted and seems to work. Sorry. But it should not work, according to the documentation. It could be that though it works, it will generate wrong plans when it encounters more complex queries (such as insert into .. select from ..., or self-joins) – wildplasser Sep 18 '11 at 20:11
  • I think the general consensus is "Yes, this will work at least in my specific case" but "It's hacky, you're playing with fire, it's hard to read, and rules are bad mmm kay".. It was a good excuse to play around with a PG feature I hadn't used before, but I don't think it's the right solution for the long term.. – Mike Christensen Sep 18 '11 at 20:15