How do I extract a number followed by a specific string in PostgreSQL?

Question

I have the following table:

CREATE TABLE test_regex (
    drug TEXT
);

INSERT INTO test_regex
VALUES
    ('DRUGA 200 MG'),
    ('DRUGB 150 MCG'),
    ('DRUGC 1.5 GM BOX'),
    ('DRUGD 27.2 MG/5 ML VIAL')
;

I want to extract the following:

200 MG
150 MCG
1.5 GM
27.2 MG

What I have tried so far:

SELECT
    substring(drug, '[0-9]*\.?[0-9]* MG|GM|MCG')
FROM
    test_regex
;

Which results in:

 200 MG
 MCG
 GM
 27.2 MG

I've also tried:

SELECT
    substring(drug, '[0-9]*\.?[0-9]* (MG|GM|MCG)')
FROM
    test_regex
;

With this result:

 MG
 MCG
 GM
 MG

I think the problem lies in how the trailing (MG|GM|MCG) group is handled but I couldn't find what I was looking for in the PostgreSQL Docs. I was expecting to get the first number section, then a space, then either MG, GM or MCG. I think its getting grouped as the number expression with MG, then either GM or MCG.

A [*non-capture group*](https://stackoverflow.com/a/3513858/712526) should do the trick, although I'm unsure how to write these with Postgres. Unlike a capture group (which is the default with parentheses), it does not change the matching semantics. Having said that, there's probably also a way to get your capture group to work, by somehow selecting which matching group you want to have as the result. (Typically, match group 0 is the full match, while match group one is the first capture group, i.e. the one you're seeing as your result.) — jpaugh, Jul 25 '18 at 23:18

Wiktor Stribiżew · Accepted Answer · 2018-07-25T23:30:23.563

3

The main idea is that you need to group alternatives that should match at one and the same location in a string. Also, I suggest using word boundaries to match the strings as whole words.

Also, note that substring returns only that part of a match that is captured by the capturing group if there is any:

if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned.

So, the grouping construct you may use is a non-capturing group, (?:...|...).

You may use

substring(drug, '\m[0-9]*\.?[0-9]+\s*(?:MG|GM|MCG)\M')

See the online demo.

Pattern details

\m - start of a word
[0-9]* - zero or more digits
\.? - an optional .
[0-9]+ - 1+ digits
\s* - 0+ whitespaces
(?:MG|GM|MCG) - either MG, orGM or MCG (you may write it as (?:MC?G|GM) to make it more efficient)
\M - end of word.

edited Jul 25 '18 at 23:30

answered Jul 25 '18 at 23:24

Wiktor Stribiżew

607,720
39
448
563

The `?:` to make the group a non-capture group did the trick. Could you explain the reason for the `\m` start of word and `\M` end of word? Thanks again! – Gregory Arenius Jul 25 '18 at 23:30
1

@GregoryArenius `\m...\M` = `\y...\y` (in other regex flavors, `\b...\b`) are used to match as whole words (it won't match `5MG` in `5MGGG`. – Wiktor Stribiżew Jul 25 '18 at 23:31
1

@ Wiktor Stribiżew Thanks one more time for adding that last bit from the docs on the substring function and how it deals with capturing groups. That really makes clear to me what was happening. – Gregory Arenius Jul 25 '18 at 23:41

How do I extract a number followed by a specific string in PostgreSQL?

1 Answers1