5

I need to build a query in PostgreSQL and am required to find all text entries that contain a 6 digit number (e.g. 000999, 019290, 998981, 234567, etc). The problem is that the number is not necessary at the begining of the string or at its end.

I tried and didn't work:

  • [0-9]{6} - returns part of a number with more than 6 digits
  • (?:(?<!\d)\d{6}(?!\d)) - postgresql does not know about lookbehind
  • [^0-9][0-9]{6}[^0-9] and variations on it, but to no avail.

Building my own Perl/C function is not really an option as I do not have the skills required. Any idea what regexp could be used or other tricks that elude me at the moment?

EDIT

Input samples:

  • aa 0011527 /CASA -> should return NOTHING
  • aa 001152/CASA -> should return 001152
  • aa001152/CASA -> should return 001152
  • aa0011527/CASA -> should return NOTHING
  • aa001152 /CASA -> should return 001152
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
CristisS
  • 1,103
  • 1
  • 12
  • 31

2 Answers2

6

If PostgreSQL supports word boundaries, use \b:

\b(\d{6})\b

Edit:

\b in PostgreSQL means backspace, so it's not a word boundary.

http://www.postgresql.org/docs/8.3/interactive/functions-matching.html#FUNCTIONS-POSIX-REGEXP however, will explain you that you can use \y as a word boundary, as it means matches only at the beginning or end of a word, so

\y(\d{6})\y

should work.

\m(\d{6})\M

should also work.

Full list of word matches in PostgreSQL regex:

Escape  Description
\A      matches only at the beginning of the string (see Section 9.7.3.5 for how this differs from ^)
\m      matches only at the beginning of a word
\M      matches only at the end of a word
\y      matches only at the beginning or end of a word
\Y      matches only at a point that is not the beginning or end of a word
\Z      matches only at the end of the string (see Section 9.7.3.5 for how this differs from $)

New edit:

Based on your edit, you should be able to do this:

(^|[^\d])(\d+)([^\d]|$)
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
h2ooooooo
  • 39,111
  • 8
  • 68
  • 102
  • I think the requirement is that string may contain other characters also. – Naveed S Jan 24 '13 at 14:15
  • @NaveedS Since we didn't use `^` or `$`, it will match at any place in a string. – h2ooooooo Jan 24 '13 at 14:18
  • @NaveedS Yes, the string may contain other strings. Also, PostgreSQL has following boundaries: http://stackoverflow.com/questions/3825676/postgresql-regex-word-boundaries.
    I tried your solution, but didn't work. I added some samples in my question.
    – CristisS Jan 24 '13 at 14:20
  • 1
    @h2ooooooo My final query is in my answer, but it would've not been possible without your suggestions. Marked as answer and thanks! :) – CristisS Jan 24 '13 at 15:11
0

Using what @h2ooooooo proposed I managed to create the following query:

SELECT cleantwo."ID",cleantwo."Name",cleantwo."Code"
FROM
(
SELECT cleanone."ID",cleanone."Name",unnest(cleanone."Code") as "Code" -- 3. unnest all the entries received using regexp_matches (get all combinations)
FROM 
(
SELECT sd."ID", sd."Name", regexp_matches(sd."Name", '(^|[^\d])(\d+)([^\d]|$)')
    as "Code"
FROM "T_SOME_DATA" sd
WHERE substring(sd."Name" from 1 for 15) ~('(^|[^\d])(\d+)([^\d]|$)') -- 1. get all data possible
) as cleanone
WHERE cleanone."Code" IS NOT NULL -- 2. get data where code IS NOT NULL (remove useless entries)
) as cleantwo
WHERE length(cleantwo."Code")=6 -- 4. get only the combinations relevant to my initial requirement (codes with length 6)<br/>

It took me a lot of time to find this so I hope it helps someone else in the same situation. Good luck!

CristisS
  • 1,103
  • 1
  • 12
  • 31