3

I am trying to clean up a table that has a very messy varchar column, with entries of the sorts:

<u><font color="#0000FF"><a href="http://virginialidar.com/index-3.html#.VgLbFPm6e73" target="_blank">VA Lidar</a></font></u> OR <u><font color="#0000FF"><a href="https://inport.nmfs.noaa.gov/inport/item/50122" target="_blank">InPort Metadata</a></font></u>

I would like to update the column by keeping only the html links, and separating them with a coma if there are more than one. Ideally I would do something like this:

UPDATE mytable
SET column = array_to_string(regexp_matches(column,'(?<=href=").+?(?=\")','g') , ',');

But unfortunately this returns an error in Postgres 10:

ERROR: set-returning functions are not allowed in UPDATE

I assume regexp_matches() is the said set-returning function. Any ideas on how I can achieve this?

Vlad
  • 387
  • 2
  • 11
  • What's supposed to go into `column` in case there is no html link at all? NULL? Empty string? Keep original value? It would be very helpful to provide the table definition as `CREATE TABLE` script to make testing easier and clarify data types and constraints. – Erwin Brandstetter Jun 22 '18 at 23:31

3 Answers3

7

Notes

1.
You don't need to base the correlated subquery on a separate instance of the base table (like other answers suggested). That would be doing more work for nothing.

2.
For simple cases an ARRAY constructor is cheaper than array_agg(). See:

3.
I use a regular expression without lookahead and lookbehind constraints and parentheses instead: href="([^"]+)

See query 1.

This works because parenthesized subexpressions are captured by regexp_matches() (and several other Postgres regexp functions). So we can replace the more sophisticated constraints with plain parentheses. The manual on regexp_match():

If a match is found, and the pattern contains no parenthesized subexpressions, then the result is a single-element text array containing the substring matching the whole pattern. If a match is found, and the *pattern* contains parenthesized subexpressions, then the result is a text array whose n'th element is the substring matching the n'th parenthesized subexpression of the pattern

And for regexp_matches():

This function returns no rows if there is no match, one row if there is a match and the g flag is not given, or N rows if there are N matches and the g flag is given. Each returned row is a text array containing the whole matched substring or the substrings matching parenthesized subexpressions of the pattern, just as described above for regexp_match.

4.
regexp_matches() returns a set of arrays (setof text[]) for a reason: not only can a regular expression match several times in a single string (hence the set), it can also produce multiple strings for each single match with multiple capturing parentheses (hence the array). Does not occur with this regexp, every array in the result holds a single element. But future readers shall not be lead into a trap:

When feeding the resulting 1-D arrays to array_agg() (or an ARRAY constructor) that produces a 2-D array - which is only even possible since Postgres 9.5 added a variant of array_agg() accepting array input. See:

However, quoting the manual:

inputs must all have same dimensionality, and cannot be empty or NULL

I think this can never fail as the same regexp always produces the same number of array elements. Ours always produces one element. But that may be different with other regexp. If so, there are various options:

  1. Only take the first element with (regexp_matches(...))[1]. See query 2.

  2. Unnest arrays and use string_agg() on base elements. See query 3.

Each approach works here, too.

Query 1

UPDATE tbl t
SET    col = (
   SELECT array_to_string(ARRAY(SELECT regexp_matches(col, 'href="([^"]+)', 'g')), ',')
   );

Columns with no match are set to '' (empty string).

Query 2

UPDATE tbl
SET    col = (
   SELECT string_agg(t.arr[1], ',')
   FROM   regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
   );

Columns with no match are set to NULL.

Query 3

UPDATE tbl
SET    col = (
   SELECT string_agg(elem, ',')
   FROM   regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
        , unnest(t.arr) elem
   );

Columns with no match are set to NULL.

db<>fiddle here (with extended test case)

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks Erwin, I've had success using your 3 queries, and I indeed prefer avoiding subqueries. May I ask why you use this particular regexp? Im am trying to match only the actual links inside the quotes without the 'href=' part. – Vlad Jun 25 '18 at 13:35
  • My regexp is simpler and shorter (maybe faster, too - didn't test), while doing the same (effectively, for this use case). – Erwin Brandstetter Jun 26 '18 at 01:11
  • I know very little about regex, and your regexp does indeed achieve the same result when applying it to the query. Any idea why the result is then different when testing on this widget? https://regexr.com/3rjkj – Vlad Jun 26 '18 at 13:49
  • @Vlad: It's because the function returns *parenthesized subexpressions*. See the added explanation above. – Erwin Brandstetter Jun 26 '18 at 15:43
1

You could use a correlated subquery to deal with the offending set-returning function (which is regexp_matches). Something like this:

update mytable
set column = (
    select array_to_string(array_agg(x), ',')
    from (
        select regexp_matches(t2.c, '(?<=href=").+?(?=\")', 'g')
        from t t2
        where t2.id = t.id
    ) dt(x)
)

You're still stuck with the "CSV in a column" nastiness but that's a separate issue and presumably not a problem for you.

mu is too short
  • 426,620
  • 70
  • 833
  • 800
0

Building on the approach of mu is too short with slightly different regex and a COALESCE function to retain values that do not contain href-links:

UPDATE a 
SET    bad_data = COALESCE(
  (SELECT Array_to_string(Array_agg(x), ',') 
   FROM   (SELECT Regexp_matches(a.bad_data, 
                                 '(?<=href=")[^"]+', 'g' 
                                ) AS x 
           FROM   a a2 
           WHERE  a2.id = a.id) AS sub), bad_data
);

SQL Fiddle

wp78de
  • 18,207
  • 7
  • 43
  • 71