regex for catching abbreviations

Question

I am trying to make a regex that matches abbreviations and their full forms in a string. I have a regex that catches some cases but on the example below, it catches more words than it should. Could anyone please help me fix this?

x = 'Confirmatory factor analysis (CFA)  is a special case of what is known as structural equation modelling (SEM).'

re.findall(r'\b([A-Za-z][a-z]+(?:\s[A-Za-z][a-z]+)+)\s+\(([A-Z][A-Z]*[A-Z]\b\.?)',x)

out:

[('Confirmatory factor analysis', 'CFA'),
 ('special case of what is known as structural equation modeling', 'SEM')]

What is it supposed to match instead? What is your intended criteria to match an abbreviation? — Barmar, Mar 12 '20 at 16:46
I think there is no rules to associate acronym and its original words. For example, "light amplification by stimulated emission of radiation" and **LASER**. It is different their lengths. You should decide how to associate acronyms and their original words. — Boseong Choi, Mar 12 '20 at 16:52
It has to catch every word that represent the acronym. In my example the first is correct while the second has caught more words. It was supposed to match only the Structural equation modeling. I'm quite new with regex — hippocampus, Mar 12 '20 at 16:53
Your acronyms can easily be capture with `(?<=\()[A-Z]+(?=\))` so once you do that then you just need to translate them. — MonkeyZeus, Mar 12 '20 at 16:56
SEM can be acronym of "**S**pecial case of what is known as structural **E**quation **M**odeling". Acronyms are depend on its definition. I think the problem is how to figure out original words. Before regex. — Boseong Choi, Mar 12 '20 at 16:56
1. There is no way of knowing how many words prior to `(CFA)` constitute the so-called full form. You could look at the number of alphas in group 2, split group 1 on whitespace, take the last n words based on the length of group 2 and then rejoin. 2. Your regex would accept `(CFA.)` but not `(C.F.A.)`. — Booboo, Mar 12 '20 at 16:57
What about, `Check for a match with Confirmatory factor analysis (CFA).`? — Cary Swoveland, Mar 12 '20 at 17:44

Booboo · Answer 1 · 2020-03-13T19:43:39.650

There is only one way of knowing how many words prior to (CFA) constitute the so-called full form: Look at the number of alphas in group 2 (assign to l), split group 1 on whitespace, take the last l words based on the length of group 2 and then rejoin.
Your regex would accept (CFA.) but not (C.F.A.) so a slight modification to your regex is in order to allow an optional period after each alpha and it appears you are attempting to say that the abbreviation must consist of two or more alpha characters -- there is an easier way to express that.

Change to Group 2 in the regex:

(                    # start of group 2
  (?:                # start of non-capturing group
     [A-Z]           # an alpha character
     \.?             # optionally followed by a period
  )                  # end of non-capturing group
  {2,}               # the non-capturing group is repeated 2 or more times
)                    # end of group 2

The code:

#!/usr/bin/env python3

import re

x = 'Confirmatory factor analysis (CFA)  is a special case of what is known as structural equation modelling (S.E.M.).'
results = []
split_regex = re.compile(r'\s+')
for m in re.finditer(r'\b([A-Za-z][a-z]*(?:\s[A-Za-z][a-z]*)+)\s+\(((?:[A-Z]\.?){2,})\)', x):
    abbreviation = m[2]
    l = sum(c.isalpha() for c in abbreviation)
    full_form = ' '.join(split_regex.split(m[1])[-l:])
    results.append([full_form, abbreviation])
print(results)

Prints

[['Confirmatory factor analysis', 'CFA'], ['structural equation modelling', 'S.E.M.']]

Python Demo

Thank you very much. How would you change the regex for cases that the acronym is not in parenthesis and the long-form is not before it. Like if the string is something like ' LOL is an acronym meaning laughing out loud' — hippocampus, Mar 13 '20 at 18:44
But it looks fairly difficult unless `laughing out loud`, i. e. the "full form", is guaranteed to be not followed by any other words. — Booboo, Mar 13 '20 at 19:30
I actually just updated the regex realizing that it did not recognize single-letter words. — Booboo, Mar 13 '20 at 19:44

score 0 · Answer 2 · answered Mar 12 '20 at 16:57

0

try this-- it works by looking for an uppercase string enclosed by parenthesis. then we validate the preceding words match the abbrv.


import re

string = 'Confirmatory factor analysis (CFA)  is a special case of what is known as structural equation modelling (SEM).'

abbrvs  = re.findall("\(([A-Z][A-Z]+)\)", string) #find potential abbrvs

words = re.split("\s|\.|,", string) 

validated_abbrvs = []
for abbrv in abbrvs:
    end = words.index(f"({abbrv})")
    start = end - len(abbrv) 
    full_name = words[start:end] #locate preceeding words
    if "".join([w[0].upper() for w in full_name]) == abbrv: #validate it matches abbrv
        validated_abbrvs.append((abbrv, " ".join(full_name)))

print(validated_abbrvs)

answered Mar 12 '20 at 16:57

Peter

169
5

What if the abbreviation were `(CFA.)`, which is allowed by the OP's regex. Your solution would fail to find all the abbreviations. – Booboo Mar 12 '20 at 17:59
maybe I'm misundestanding. yes his regex may catch but is that the intention? Seems like that's in invalid format. – Peter Mar 12 '20 at 18:02
One never knows. I suggested a change to the regex to allow `(C.F.A.)`, which seems reasonable. But in that case you have to be careful and recognize that the *effective* length is 3 and not 6, i.e. you need to count the actual *alpha* characters. Take a look at my solution. – Booboo Mar 12 '20 at 18:07
Depends on the format. thousand ways something could be abbreviated. Technically, the abbrevation could be COFAAN. you need rules. punctuation could also be ignored since it is irrelevant to the search – Peter Mar 12 '20 at 18:09
and by ignore punctuation, when doing your search `string = string.replace(".","")` --if that's a requirement – Peter Mar 12 '20 at 18:10
If the period is a required part of the abbreviation, I wouldn't be deleting them. I just think the OP incorrectly specified the optional use of periods. Some acronyms use them and some don't, but it is most *unusual* just to have one single period at the end. – Booboo Mar 12 '20 at 18:12

jose_bacoy · Answer 3 · 2020-03-12T18:48:16.387

0

I used regular expression and split the string by ( or ). Then create a list of tuples in sequential index.

import re
x = 'Confirmatory factor analysis (CFA)  is a special case of what is known as structural equation modelling (SEM).'
lst = re.split('\(|\)', x)
lst = [(lst[i*2].strip(), lst[i*2+1].strip()) for i in range(0, len(lst)//2)]
final = []
for i in range(len(lst)):
    abbr = lst[i][1]
    text = ' '.join(lst[i][0].split(' ')[-len(abbr):])
    final.append((abbr, text)) 
final

Result:

 [('CFA', 'Confirmatory factor analysis'),
 ('SEM', 'structural equation modelling')]

edited Mar 12 '20 at 18:48

answered Mar 12 '20 at 16:59

jose_bacoy

12,227
1
20
38

`SEM` is *not* an abbreviation of `is a special case of what is known as structural equation modelling` but rather `structural equation modelling`. – Booboo Mar 12 '20 at 18:18

regex for catching abbreviations

3 Answers3

Linked