how to match abbreviations with their meaning with regex?

Question

I'm looking for a regex pattern that matches the following string:

Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred.

My goal is to match the following:

Some example text (SET)
Energy system models (ESM)
specific optima (SCO)
computer systems (CUST)
outside (OUTS)

The important part is that it's not always exactly three words and their first letter. Sometimes the letters used for the abbreviation are merely contained in the preceding words. That's why I started looking into the positive lookbehind. However, it is constrained by length, which can be worked around by combining it with a positive lookahead. So far I couldn't come up with a robust solution though.

What I've tried so far:

(\b[\w -]+?)\((([A-Z])(?<=(?=.*?\3))(?:[A-Z]){1,4})\)

This works reasonable well but matches include too many words:

Some example text (SET)
Energy system models (ESM)
are used to find specific optima (SCO)
Some say Computer systems (CUST)
In the summer playing outside (OUTS)

I have also tried to use a reference to the first letter of the abbreviation at the start of the first group. That didn't work at all though.

Things I have looked at but didn't find useful:

Useful resources:

There is no logic to connect the uppercase chars between the parenthesis to the words before it right? — The fourth bird, Aug 28 '20 at 10:00
Try `[x.group() for x in re.finditer(r'\b([A-Z])\w*(?:\s+\w+)*?\s*\(\1[A-Z]*\)', text)]` ([regex demo](https://regex101.com/r/K0XxCt/1)) — Wiktor Stribiżew, Aug 28 '20 at 10:03
@Thefourthbird the logic is that it some how abbreviates the word(s) beforehand, and therefore the uppercase chars have to be contained within them. — david, Aug 28 '20 at 10:06
Ah, it must be `[x.group() for x in re.finditer(r'\b([A-Z])\w*(?:\s+\w+)*?\s*\(\1[A-Z]*\)', text, re.I)]` ([**Python demo**](https://ideone.com/GEc6dg)). I am not just sure if checking just the first word initial letter is fine with OP. **@david**, is it good enough, or do you think there must be a more complex logic? — Wiktor Stribiżew, Aug 28 '20 at 10:16
@WiktorStribiżew That doesn't seem quite right, because abbreviations should be all upper-case. Otherwise, there could be false positives from cases like: `Stupid example(s)`. — ekhumoro, Aug 28 '20 at 10:25
@ekhumoro That is why I say "good enough". In cases like this, it is not easy to grab all valid occurrences with a plain simple regex. — Wiktor Stribiżew, Aug 28 '20 at 10:27
@WiktorStribiżew I suppose the OP could work around that by adding an if-condition to the comprehension: `[x.group() for x in re.finditer(r'\b([A-Z])\w*(?:\s+\w+)*?\s*\((\1[A-Z]*)\)', s, re.I) if x.group(2).isupper()]`. — ekhumoro, Aug 28 '20 at 10:45
@ekhumoro Yes, there can be done even more with additional code. — Wiktor Stribiżew, Aug 28 '20 at 10:46

Wiktor Stribiżew · Accepted Answer · 2020-09-01T13:06:38.513

I suggest using

import re
def contains_abbrev(abbrev, text):
    text = text.lower()
    if not abbrev.isupper():
        return False
    cnt = 0
    for c in abbrev.lower():
        if text.find(c) > -1:
            text = text[text.find(c):]
            cnt += 1
            continue
    return cnt == len(abbrev)
 
text= "Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred. Stupid example(s) Stupid example(S) Not stupid example (NSEMPLE), bad example (Bexle)"
abbrev_rx = r'\b(([A-Z])\w*(?:\s+\w+)*?)\s*\((\2[A-Z]*)\)'
print( [x.group() for x in re.finditer(abbrev_rx, text, re.I) if contains_abbrev(x.group(3), x.group(1))] )

See the Python demo.

The regex used is

(?i)\b(([A-Z])\w*(?:\s+\w+)*?)\s*\((\2[A-Z]*)\)

See the regex demo. Details:

\b - word boundary
(([A-Z])\w*(?:\s+\w+)*?) - Group 1 (text): an ASCII letter captured into Group 2, then 0+ word chars followed with any 0 or more occurrences of 1+ whitespaces followed with 1+ word chars, as few as possible
\s* - 0+ whitespaces
\( - a ( char
(\2[A-Z]*) - Group 3 (abbrev): same value as in Group 2 and then 0 or more ASCII letters
\) - a ) char.

Once there is a match, Group 3 is passed as abbrev and Group 1 is passedas text to the contains_abbrev(abbrev, text) method, that makes sure that the abbrev is an uppercase string and that the chars in abbrev go in the same order as in text, and are all present in the text.

the `text` in `contains_abbrev` should be converted to lower, and the abbreviation part of the regex should have fixed length of `{2,}`. With those changes it works as expected. — david, Aug 31 '20 at 11:36
@david No need to lowercase, `re.I` makes the pattern case insensitive. If you need to make sure there are at least 2 chars in the abbreviation, replace `\2[A-Z]*` with `\2[A-Z]+`. — Wiktor Stribiżew, Aug 31 '20 at 11:49
there is a need to lowercase, since the text used for comparison isn't necessarily lowercase. Therefore, your code fails so far for the first abbreviation. — david, Sep 01 '20 at 13:03
@david I see what you mean now. I added `text` lowercasing line. — Wiktor Stribiżew, Sep 01 '20 at 13:06

score 0 · Answer 2 · answered Aug 29 '20 at 23:13

Just regex won't be enough .. looks like you might a python script for this... this should handle all your scenarios:

import re
a="Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred.";
b=re.findall("(\((.*?)\))",a)
a=a.replace(".","")
i=a.split(' ')
for c in b:
   cont=0
   m=[]
   s=i.index(c[0])
   l=len(c[1])
   al=s-l
   for j in range(al,s+1):
       if i[j][0].lower() == c[1][0].lower():
            cont=1
       if cont == 1:
            m.append(i[j])
   print(' '.join(m))

Output:

Some example text (SET)

Energy system models (ESM)

specific optima (SCO)

computer systems (CUST)

outside (OUTS)

how to match abbreviations with their meaning with regex?

2 Answers2

Linked