1

I have a table that contains HTML content. This content may contain one or several URLs. I also have a mapping table that contains URLs with their associated rewrite. I need to be able in each HTML content to replace all the URLs with their rewrite when it exists.

The use case (Postgres 9.5):

TABLE some_content (content_id int4, content text)
row1: 1, 'A BA BLAH PIKA',
row2: 2, 'B AB',
row3: 3, 'C PIKA NOTA CA'

TABLE rewrite (rule_id int4, old_string text, new_string text)
row1: 1, 'PIKA', 'CHU',
row2: 2, 'BLAH', 'POM'

The query should ouput the following set:

row1: 1, 'A BA POM CHU'
row2: 2, 'B AB'
row3: 3, 'C CHU NOTA CA'

Adding a new row in the rewrite table like:

row3: 3, 'NOTA', 'ISB'

would then transform the resulting set into (int4, text):

row1: 1, 'A BA POM CHU'
row2: 2, 'B AB'
row3: 3, 'C CHU ISB CA'

Any hint?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
greg
  • 3,354
  • 1
  • 24
  • 35

1 Answers1

1

Each replacement depends on the result of the last. You need a loop of some kind. And you need a deterministic order within replacements. Assuming rule_id in ascending order. And assuming you want to replace any match, not just whole words (easy to adapt).

You could loop in a plpgsql function. Probably faster. See:

Or, for pure SQL, try this recursive CTE:

WITH RECURSIVE cte AS (
   SELECT s.content_id, r.rule_id
        , replace(s.content, r.old_string, r.new_string) AS content
   FROM   some_content s
   CROSS  JOIN (
      SELECT rule_id, old_string, new_string
      FROM   rewrite
      ORDER  BY rule_id  -- order of rows is relevant!
      LIMIT  1
      ) r

   UNION ALL
   SELECT c.content_id, r.rule_id
        , replace(c.content, r.old_string, r.new_string) AS content
   FROM   cte c
        , LATERAL (
      SELECT rule_id, old_string, new_string
      FROM   rewrite
      WHERE  rule_id > c.rule_id
      ORDER  BY rule_id  -- order of rows is relevant!
      LIMIT  1
      ) r
   )
SELECT DISTINCT ON (content_id) content
FROM   cte
ORDER  BY content_id, rule_id DESC;

The LATERAL join to work around "invalid reference to FROM-clause entry for table "c" you would get with a direct subquery referencing the CTE. Related:

Or, using row_number() to produces a serial number without gaps like you commented:

WITH RECURSIVE r AS (
   SELECT old_string, new_string
        , row_number() OVER (ORDER BY rule_id) AS rn  -- your ORDER BY expression?
   FROM   rewrite
   )
 , cte AS (
   SELECT s.content_id, r.rn
        , replace(s.content, r.old_string, r.new_string) AS content
   FROM   some_content s
   JOIN   r ON r.rn = 1

   UNION ALL
   SELECT s.content_id, r.rn
        , replace(s.content, r.old_string, r.new_string) AS content
   FROM   cte s
   JOIN   r ON r.rn = s.rn + 1
   )
SELECT DISTINCT ON (content_id) content
FROM   cte
ORDER  BY content_id, rn DESC;

dbfiddle here

It's often overlooked that plain CTEs can still be added after WITH RECURSIVE:

About DISTINCT ON:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • No worries, I asked for a hint. I went for the recursive CTE as well but found that akward with multiple expressions in a CTE (one for adding the row_number() in rewrite rule to add the asked identifier). Your suggestion to use a loop in a PLPgSQL function is also a good hint. Thank you. – greg Oct 23 '17 at 13:51
  • 1
    Now it works. The `LATERAL` join to work around `"invalid reference to FROM-clause entry for table "c"` – Erwin Brandstetter Oct 23 '17 at 13:53
  • Is that faster to use `WHERE rule_id > c.rule_id … ORDER BY … LIMIT 1` or to directly `INNER JOIN on an incrementing id` ? – greg Oct 23 '17 at 14:03
  • @greg: The latter - if you can guarantee there are no gaps. – Erwin Brandstetter Oct 23 '17 at 14:04
  • Sure, it is obtained through a `row_number()` window function so it should have no gaps. Thank you for the -- as always -- nice and fast reply. – greg Oct 23 '17 at 14:07
  • I added a variant with `row_number()`. – Erwin Brandstetter Oct 23 '17 at 21:56