1

To change tag pairs around text, this Postgres SELECT expression works for me:

select regexp_replace('The corpse of the huge <i>fin whale</i> created a spectacle on <span class="day">Friday</span> as <i>people</i> wandered the beach to observe it.',
                      '(<i>)([^/]+)(</i>)',
                      '<em>\2</em>',
                      'g');

I worry about excessive greed though on reference number two. My first try for reference number two was (.+) and that was a failure. The ([^/]+) works better. But I wonder if it is good enough.

Can anything be done to make that SELECT statement more robust?

kgrittn
  • 18,113
  • 3
  • 39
  • 47
Paulb
  • 1,471
  • 2
  • 16
  • 39
  • You didn't mention your version, but if you are not already on the latest minor (bug-fix) release of whichever major release you are running, I strongly recommend that you apply the latest bug-fix release. There have been bug fixes in the regex code. See: http://www.postgresql.org/support/versioning/ – kgrittn Dec 08 '12 at 15:53

1 Answers1

5

There generally two possibilities (and both seem to be supported by PostreSQL's regex engine).

  1. Make the repetition ungreedy:

    <i>(.+?)</i>
    
  2. Use a negative lookahead to ensure that you consume anything except for </i>:

    <i>((?:(?!</i>).)+)</i>
    

In both cases, I removed the unnecessary captures. You can use \1 now in your replacement string.

These two should be equivalent in what they do. Their performance might vary though. The former needs backtracking, while the latter has to attempt the lookahead at every single position. Which one is faster would have to be profiled and might even depend on individual input strings. Note that, since the second pattern uses a greedy repetition, you could remove the trailing </i> and you would still get the same results.

The approach you have is already robust in the sense that you can never go past a </i>. But at the same time your approach does not allow nested tags (because the repetition could not go past the closing tag of the nested pair).

However, you should note that regular expressions are not really up to the job of parsing/manipulating HTML. What if there are extraneous spaces in your tags? Or what if the opening tag has attributes? Or what if one or both of the tags occur in attribute names or comments?

Community
  • 1
  • 1
Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • I heard the futility of parsing XML.. and thought, that couldn't happen to me on my project. Well, it has happened to me and driven me to the bottle. All my data is in Posgres.. I wanted to manipulate entirely inside Postgres.. but know see that I cannot. I don't know what to parser to use and how to merge it into my work flow. – Paulb Dec 08 '12 at 18:00
  • @Paulb - you can **always** write a function in PL/pgSQL to do anything you can in any other language! – Vérace Jan 30 '22 at 17:44