4

I've got a table with entries that are similar to "something XXX010101 somethingelse XXX010102".

I'd like to extract the XXX01... parts from them - possibly using some separator if needed - right now I'm able to easily take out the first one using:

select distinct substring(content from 'XXX[0-9]+'), content from data where content ~ 'XXX[0-9]+'

The one idea I have is to create some monster regex to replace everything that's not an XXX substring... but to be true I hoped (and failed to find) for a simpler solution (like a g flag to substring).

When looking around I found that 8.3 introduced a regex_matches function that seems to be what I'd need - are there any alternatives in 8.2? Or an easy way to get it in 8.2?

How would you go about solving this? Or is an upgrade my best bet? ;) (it's a production system so the downtime and some risk of problems after migration is a source of hesitation ).

Thanks in advance.

-- expected output added --

for

"something XXX010101 something else XXX010102"

I'd like to get:

XXX010101
XXX010102

OR (less preferable)

XXX010101,XXX010102 
Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
RnR
  • 2,096
  • 1
  • 15
  • 23
  • 1
    Are you aware that 8.2 will be de-supported soon? –  Dec 02 '11 at 13:24
  • I was not aware - thanks. We never had any issues so support was not really an issue and I plan to update once we get a downtime "either way" for some other reason (datacenter migration etc (it's planned whithin a couple months)) – RnR Dec 02 '11 at 13:29
  • 2
    De-Supported means that there will be no more bug fixes (security or otherwise) beginning next year. See here: http://wiki.postgresql.org/wiki/PostgreSQL_Release_Support_Policy –  Dec 02 '11 at 13:32
  • Release policy has been moved to official site: http://www.postgresql.org/support/versioning/ – Erwin Brandstetter Apr 12 '12 at 04:45

3 Answers3

2

I am somewhat hesitant to even post my answer. You really must upgrade. Version 8.2 reaches end of live right now. Follow the link @a_horse_with_no_name posted.

However, the problem got my attention. The following demo should work with PostgreSQL 8.2:

SELECT -- content,
         rtrim(
         regexp_replace(
         regexp_replace(
         replace(
         substring(content
        ,E'(XXX\\d+|XXX\\d+.*XXX\\d+)')  -- kill leading / trailing noise
        ,',',' ')                        -- remove all ","
        ,E'(XXX\\d+)', E'\\1,', 'g')     -- terminate X-strings with ","
        -- now we can use non-greedy terminated with ","
        ,E'(XXX\\d+?,)*?.*?(XXX\\d+?,)', E'\\1\\2', 'g')
        ,',') AS result
FROM    (VALUES
  ('no match')
 ,('XXX010101')
 ,('XXX010102 beginn')
 ,('end XXX010103')
 ,('middle XXX010104 match')
 ,('match middle XXX010105 and end XXX010106')
 ,('XXX010107, match beginn XXX010108 & middle')
 ,('XXX010109 begin and end XXX010110')
 ,('XXX01011 begin XXX010112 middle and end XXX010113')
 ,('XXX01014 begin XXX010115 close matches XX010113 XXXy010113 23624 ,XXX010116')
 ,('f XXX01017 B XXX010118 457XXX010119 XXXy XXX010120 overkill XXX010121end')
) data(content)

Result:

                     result
--------------------------------------------------
             -- first line is NULL
 XXX010101
 XXX010102
 XXX010103
 XXX010104
 XXX010105,XXX010106
 XXX010107,XXX010108
 XXX010109,XXX010110
 XXX01011,XXX010112,XXX010113
 XXX01014,XXX010115,XXX010116
 XXX01017,XXX010118,XXX010119,XXX010120,XXX010121

Some explanation:

  • no regex_matches() in version 8.2 as OP already stated
  • but there is regexp_replace() which can use a g flag (replace globally)
  • we cannot mix greedy and non-greedy quantifiers in the same regular expression
  • so I terminate the wanted strings with a ,, after removing all other occurances of , Could be any character not part of the wanted strings, but , can serve as separator in the result.
  • first cut leading and trailing noise
  • then globally replace everything in between wanted strings.
  • to make that work use (XXX\\d+?,)*? to capture any number of leading occurances of wanted string.
  • final rtrim() removes trailing ,

  • In PsotgreSQL 8.3+ you could use regexp_split_to_table() to split the wanted strings into single rows. In 8.2 you'll have to come up with something of your own. I would write a plgpsql function ...

This makes heavy use of a features of PostgreSQL's POSIX Regular Expressions (link to version 8.2!)

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
1

How about something like this (assuming the values you're looking for are contained in a separate table)... I'll make no claims as to good performance though...

Select A.text, B.Text2, B.Val 
FROM A 
INNER JOIN B ON B.Text2 LIKE ('%' || A.Text || '%')

let table A contain all possible combinations of XXX010101... that your looking for

text
XXX010101
XXX010102
XXX010103

let table B contain all the search texts you want

text 2                                              val
something XXX010101 something else XXX010102        1
yet another XXX010102 and this XXX010103            2
XXX010105                                           3
XXX010103                                           4

Results:

text            text2                                           VAL
XXX010101   something XXX010101 something else XXX010102    1
XXX010102   something XXX010101 something else XXX010102    1
XXX010102   yet another XXX010102 and this XXX010103        2
XXX010103   yet another XXX010102 and this XXX010103        2
XXX010103   XXX010103                                       4

--------WRONG but misunderstood the question...

What's wrong with the replace function? http://www.postgresql.org/docs/8.2/interactive/functions-string.html

replace( 'abcdefabcdef', 'cd', '')

so the characters are set to empty string.

xQbert
  • 34,733
  • 2
  • 41
  • 62
  • To be true it's mostly that the regex would be hard to read and "logically reversed" and I'd need to parse the output either way on the client side - and in that case I can do this on the client side altogether and it'll be much more readable too I think? (while ($content =~ m/XXX[0-9]{6}/g) ... ). – RnR Dec 02 '11 at 12:51
  • 1
    I guess I don't understand the question then. Can you update the question to include Sample input/output that you want? – xQbert Dec 02 '11 at 13:05
  • Yep, I totally miss understood the question :D – xQbert Dec 02 '11 at 13:50
  • Is there by chance a table which contains all possible values of the XXX010101, XXX010102... – xQbert Dec 02 '11 at 14:03
0

The fastest method is to use plperlu which works with 8.2.

CREATE LANGUAGE plperl

CREATE FUNCTION get_things(inputStr text)
RETURNS SETOF text
AS $BODY$
  return \@{[ $_[0] =~ m/(XXX\d{6})/g ]} 
$BODY$
LANGUAGE plperl
IMMUTABLE;

SELECT get_things(x)
FROM ( VALUES
  ('XXX010101 somethingelse XXX010102')
) AS t(x);

 get_things 
------------
 XXX010101
 XXX010102
(2 rows)

It also works with new versions of PostgreSQL.

Evan Carroll
  • 78,363
  • 46
  • 261
  • 468