There is only one way of knowing how many words prior to (CFA) constitute the so-called full form: Look at the number of alphas in group 2 (assign to l
), split group 1 on whitespace, take the last l
words based on the length of group 2 and then rejoin.
Your regex would accept (CFA.) but not (C.F.A.) so a slight modification to your regex is in order to allow an optional period after each alpha and it appears you are attempting to say that the abbreviation must consist of two or more alpha characters -- there is an easier way to express that.
Change to Group 2 in the regex:
( # start of group 2
(?: # start of non-capturing group
[A-Z] # an alpha character
\.? # optionally followed by a period
) # end of non-capturing group
{2,} # the non-capturing group is repeated 2 or more times
) # end of group 2
The code:
#!/usr/bin/env python3
import re
x = 'Confirmatory factor analysis (CFA) is a special case of what is known as structural equation modelling (S.E.M.).'
results = []
split_regex = re.compile(r'\s+')
for m in re.finditer(r'\b([A-Za-z][a-z]*(?:\s[A-Za-z][a-z]*)+)\s+\(((?:[A-Z]\.?){2,})\)', x):
abbreviation = m[2]
l = sum(c.isalpha() for c in abbreviation)
full_form = ' '.join(split_regex.split(m[1])[-l:])
results.append([full_form, abbreviation])
print(results)
Prints
[['Confirmatory factor analysis', 'CFA'], ['structural equation modelling', 'S.E.M.']]
Python Demo