1

I'm getting familiar with postgres sql, but having some trouble with pattern matching. I read the documentation and looked through other questions, but couldn't solve this on my own.

I have a field with lots of text data, in the middle of it, numbers with this pattern:

"2021-1234567" (four digits + - + seven digits)

Problem is, it can have other number sequences. Like this:

"Project number 12345678912345 with id 2020-2583697 1456" (in this case, i need to extract 2020-2583697)

In some cases it may be just eleven digits, like this:

"Project 12345678912345 sequence 20202583697 1456" (in this case i need to extract 20202583697)

At first i tried to extract only the numbers (the text is mostly user input) with:

SELECT
   SUBSTRING("my_field", '^[0-9]+$' )
FROM 
    my_table

That didn't help at all... Can anyone help me?

Luan
  • 79
  • 6
  • Does the piece of text your are looking for always start with `2020` ? – Luuk Aug 16 '21 at 16:45
  • It usually stars with some year, but not only 2020 (i have 2016, 2018, 2020, 2021 and many others) – Luan Aug 16 '21 at 16:48
  • 1
    Or, `select substring('your field','20[12][0-9]-[0-9]+$') from mytable ` will match all values starting with a values between `2010` and `2029`. – Luuk Aug 16 '21 at 16:49
  • 2
    Have a look at [regex101.com](https://regex101.com). There you can play with regular expressions and get also an explanation. – Steeeve Aug 16 '21 at 16:58
  • @Luuk I've noticed when the numbers are atached to text this doens't work (like in project2020-1234567abc) Is this expected? – Luan Aug 16 '21 at 17:07
  • @Steeeve i am looking to learn more about pattern matching in sql, but i apreciate the help, will look into it! – Luan Aug 16 '21 at 17:09
  • 1
    @Luan the patterns here, also yours `^[0-9]+$` are regular expressions ;) Just paste it into the regex field in regex101 and you get the explanation on the right side what it searches. – Steeeve Aug 16 '21 at 17:19
  • 1
    @Luan: The `$` at the end means that you will look for the end of the string. see (for an overview of those special things): [Reference - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) – Luuk Aug 16 '21 at 17:19

2 Answers2

1

This appears to do what you want:

select substring(str, '[0-9]{4}-?[0-9]{7}')
from (values ('asfasdf 2020-2583697 qererf i0iu0 1234234'),
             ('asfasdf 20202583697 qererf i0iu0 1234234')
             ) v(str)

It searches for 4 digits followed by an optional hyphen followed by 7 digits.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

Or this, as I could not manage to force checking for blanks around the pattern without returning those blanks otherwise:

WITH 
indata(s) AS (
            SELECT 'Project number 12345678912345 with id 2020-2583697 1456'
  UNION ALL SELECT 'Project 12345678912345 sequence 20202583697 1456'
)
SELECT
  REGEXP_REPLACE(s,'^.* (\d{4}-?\d{7}) .*$','\1') AS found_token
, s
FROM indata;
 found_token  |                            s                            
--------------+---------------------------------------------------------
 2020-2583697 | Project number 12345678912345 with id 2020-2583697 1456
 20202583697  | Project 12345678912345 sequence 20202583697 1456
(2 rows)

The pattern used - REGEXP_REPLACE(s,'^.* (\d{4}-?\d{7}) .*$','\1') - means: replace ^.* the beginning of the string and any number of any characters, followed by a blank; then (\d{4}-?\d{7}) four digits, zero or one dash - -?, and seven digits - and the parentheses around it mean: remember this as the first group; finally: .*$ a blank, then any number of any characters till the end of the string - with group 1: \1 .

marcothesane
  • 6,192
  • 1
  • 11
  • 21