Match Acronym and their Meaning with Python Regex

Question

I am working on a Python function that will use regular expressions to find within a sentence the acronym within parentheses and its meaning within the sentence. For example, "The Department of State (DOS) is the United States federal executive department responsible for international relations of the United States."

What I have so far is:

text = "The Department of State (DOS) is the United States federal executive department responsible for international relations of the United States." 

pattern = re.compile(r"^(.*?)(?:\((.*)\))?$")
result = ''
for i in pattern.finditer(text):
    result += text

print (result)

The output returns the entire text sentence. I am new to using regex and probably misunderstanding the structure. From what I understand, r will match the characters, the ^ asserts the position at the start of the string, .*? matches any character, *? matches between zero and unlimited times, the ? will match zero or one times, the  will match the parentheses, and the $ asserts the position at the end. I apologize if I am misunderstanding any of this greatly, I appreciate any help with understanding this.

Thanks!

Your current pattern `r"^(.*?)(?:$(.*)$)?$"` will match the start of the line `^`, followed by anything `(.*?)`, followed by 0 or 1 instances of anything in parenthesis `$(.*)$?` that is at the end of the line `$`. So you will only see something in parenthesis if it is at the end of the line. — James, Nov 16 '16 at 02:03
Hi, this isn't exactly what you're looking for, but I made this in python without regex to try to solve the same problem: https://gist.github.com/nmolivo/ed07ccc158e230b8e7fcaa3b04dbabc1 Hope it can help! — Natalie Olivo, Feb 20 '20 at 08:00

score 0 · Accepted Answer · answered Nov 16 '16 at 02:37

r will match the characters

'r' is a python prefix that will result in the string to be considered as a raw string literal. It is not part of the re syntax.

the ? will match zero or one times,

This ? referred here is part of (?: which implies that this becomes a non capturing group that is part of the match but not returned as a matched group.

$ asserts the position at the end

It asserts the position at the end of the entire string, and not only the matched portion.

This pattern will obtain the name as well as abbreviation:

pattern = re.compile("^(.*?)\((.*?)?\)") 
for i in pattern.finditer(text):
    name, abbrev = i.groups() 
    print name.strip(), abbrev

score 0 · Answer 2 · answered Nov 16 '16 at 02:45

You can do something like this.

import re

text = "The Department of State (DOS) is the United States federal executive department responsible for international relations of the United States." 

acronym = re.search(r"(?<=\().*?(?=\))", text).group(0).lower()

regex = r"(?<= )"
for i in range(0, len(acronym)):
    if i > 0: regex += " "
    regex += acronym[i] + r".*?"

regex += r"(?= )"
meaning = re.search(regex, text).group(0).lower()

print("Acronym '"+acronym+"' stands for '"+meaning+"'.")

This does not work, I'm not good with Python at all, but I guess you can fix it pretty easily. The idea is to get the string inside the parenthesis, then make a regex from it which search words beginning with the letters of the acronym.

Match Acronym and their Meaning with Python Regex

2 Answers2

Linked