Get all instances of a substring in a string in PostgreSQL?

Question

I've got a table with entries that are similar to "something XXX010101 somethingelse XXX010102".

I'd like to extract the XXX01... parts from them - possibly using some separator if needed - right now I'm able to easily take out the first one using:

select distinct substring(content from 'XXX[0-9]+'), content from data where content ~ 'XXX[0-9]+'

The one idea I have is to create some monster regex to replace everything that's not an XXX substring... but to be true I hoped (and failed to find) for a simpler solution (like a g flag to substring).

When looking around I found that 8.3 introduced a regex_matches function that seems to be what I'd need - are there any alternatives in 8.2? Or an easy way to get it in 8.2?

How would you go about solving this? Or is an upgrade my best bet? ;) (it's a production system so the downtime and some risk of problems after migration is a source of hesitation ).

Thanks in advance.

-- expected output added --

for

"something XXX010101 something else XXX010102"

I'd like to get:

XXX010101
XXX010102

OR (less preferable)

XXX010101,XXX010102

I was not aware - thanks. We never had any issues so support was not really an issue and I plan to update once we get a downtime "either way" for some other reason (datacenter migration etc (it's planned whithin a couple months)) — RnR, Dec 02 '11 at 13:29
De-Supported means that there will be no more bug fixes (security or otherwise) beginning next year. See here: http://wiki.postgresql.org/wiki/PostgreSQL_Release_Support_Policy — , Dec 02 '11 at 13:32
Release policy has been moved to official site: http://www.postgresql.org/support/versioning/ — Erwin Brandstetter, Apr 12 '12 at 04:45

Erwin Brandstetter · Answer 1 · 2011-12-02T21:59:13.713

I am somewhat hesitant to even post my answer. You really must upgrade. Version 8.2 reaches end of live right now. Follow the link @a_horse_with_no_name posted.

However, the problem got my attention. The following demo should work with PostgreSQL 8.2:

SELECT -- content,
         rtrim(
         regexp_replace(
         regexp_replace(
         replace(
         substring(content
        ,E'(XXX\\d+|XXX\\d+.*XXX\\d+)')  -- kill leading / trailing noise
        ,',',' ')                        -- remove all ","
        ,E'(XXX\\d+)', E'\\1,', 'g')     -- terminate X-strings with ","
        -- now we can use non-greedy terminated with ","
        ,E'(XXX\\d+?,)*?.*?(XXX\\d+?,)', E'\\1\\2', 'g')
        ,',') AS result
FROM    (VALUES
  ('no match')
 ,('XXX010101')
 ,('XXX010102 beginn')
 ,('end XXX010103')
 ,('middle XXX010104 match')
 ,('match middle XXX010105 and end XXX010106')
 ,('XXX010107, match beginn XXX010108 & middle')
 ,('XXX010109 begin and end XXX010110')
 ,('XXX01011 begin XXX010112 middle and end XXX010113')
 ,('XXX01014 begin XXX010115 close matches XX010113 XXXy010113 23624 ,XXX010116')
 ,('f XXX01017 B XXX010118 457XXX010119 XXXy XXX010120 overkill XXX010121end')
) data(content)

Result:

                     result
--------------------------------------------------
             -- first line is NULL
 XXX010101
 XXX010102
 XXX010103
 XXX010104
 XXX010105,XXX010106
 XXX010107,XXX010108
 XXX010109,XXX010110
 XXX01011,XXX010112,XXX010113
 XXX01014,XXX010115,XXX010116
 XXX01017,XXX010118,XXX010119,XXX010120,XXX010121

Some explanation:

no regex_matches() in version 8.2 as OP already stated
but there is regexp_replace() which can use a g flag (replace globally)
we cannot mix greedy and non-greedy quantifiers in the same regular expression
so I terminate the wanted strings with a ,, after removing all other occurances of , Could be any character not part of the wanted strings, but , can serve as separator in the result.
first cut leading and trailing noise
then globally replace everything in between wanted strings.
to make that work use (XXX\\d+?,)*? to capture any number of leading occurances of wanted string.
final rtrim() removes trailing ,
In PsotgreSQL 8.3+ you could use regexp_split_to_table() to split the wanted strings into single rows. In 8.2 you'll have to come up with something of your own. I would write a plgpsql function ...

This makes heavy use of a features of PostgreSQL's POSIX Regular Expressions (link to version 8.2!)

xQbert · Answer 2 · 2011-12-02T14:18:04.953

How about something like this (assuming the values you're looking for are contained in a separate table)... I'll make no claims as to good performance though...

Select A.text, B.Text2, B.Val 
FROM A 
INNER JOIN B ON B.Text2 LIKE ('%' || A.Text || '%')

let table A contain all possible combinations of XXX010101... that your looking for

text
XXX010101
XXX010102
XXX010103

let table B contain all the search texts you want

text 2                                              val
something XXX010101 something else XXX010102        1
yet another XXX010102 and this XXX010103            2
XXX010105                                           3
XXX010103                                           4

Results:

text            text2                                           VAL
XXX010101   something XXX010101 something else XXX010102    1
XXX010102   something XXX010101 something else XXX010102    1
XXX010102   yet another XXX010102 and this XXX010103        2
XXX010103   yet another XXX010102 and this XXX010103        2
XXX010103   XXX010103                                       4

--------WRONG but misunderstood the question...

What's wrong with the replace function? http://www.postgresql.org/docs/8.2/interactive/functions-string.html

replace( 'abcdefabcdef', 'cd', '')

so the characters are set to empty string.

To be true it's mostly that the regex would be hard to read and "logically reversed" and I'd need to parse the output either way on the client side - and in that case I can do this on the client side altogether and it'll be much more readable too I think? (while ($content =~ m/XXX[0-9]{6}/g) ... ). — RnR, Dec 02 '11 at 12:51
I guess I don't understand the question then. Can you update the question to include Sample input/output that you want? — xQbert, Dec 02 '11 at 13:05
Is there by chance a table which contains all possible values of the XXX010101, XXX010102... — xQbert, Dec 02 '11 at 14:03

Evan Carroll · Answer 3 · 2017-03-10T06:33:34.807

The fastest method is to use plperlu which works with 8.2.

CREATE LANGUAGE plperl

CREATE FUNCTION get_things(inputStr text)
RETURNS SETOF text
AS $BODY$
  return \@{[ $_[0] =~ m/(XXX\d{6})/g ]} 
$BODY$
LANGUAGE plperl
IMMUTABLE;

SELECT get_things(x)
FROM ( VALUES
  ('XXX010101 somethingelse XXX010102')
) AS t(x);

 get_things 
------------
 XXX010101
 XXX010102
(2 rows)

It also works with new versions of PostgreSQL.

Get all instances of a substring in a string in PostgreSQL?

3 Answers3

Some explanation: