Finding correct regex for a bolded/underlined strings (Python)

Question

So I have 2 sets of criterias that I would like to find in a string. For example:

import re
bold_pattern = re.compile() #pattern for finding all words in between ** **
underline_pattern = re.compile() # pattern for finding all words in between __ __
a = "__Hello__ **This** __is__ **Lego**"

How would I go abouts doing that on regex?

Start [learning capture groups](https://regexone.com/lesson/capturing_groups) — rdas, Apr 19 '20 at 04:53

Austin · Accepted Answer · 2020-04-19T05:04:31.917

5

Use capture patterns to capture words between two patterns:

bold_pattern = re.compile(r'\*\*(.*?)\*\*')   # pattern for finding all words in between ** **
underline_pattern = re.compile(r'__(.*?)__')  # pattern for finding all words in between __ __

Then use them in a re.findall:

bolds = re.findall(bold_pattern, a)
# or: bold_pattern.findall(a)
underlines = re.findall(underline_pattern, a)
# or: underline_pattern.findall(a)

edited Apr 19 '20 at 05:04

answered Apr 19 '20 at 04:58

Austin

25,759
4
25
48

1

Thanks! Side note - since it's already being compiled I would just do bold_pattern.findall(a) wouldn't I? – Lego490 Apr 19 '20 at 05:02

score 1 · Answer 2 · answered Apr 19 '20 at 04:56

1

Using re.findall we can try:

a = "__Hello__ **This** __is__ **Lego**"
terms = re.findall(r'\*\*(.*?)\*\*', a)
print(terms)

This prints:

['This', 'Lego']

answered Apr 19 '20 at 04:56

Tim Biegeleisen

502,043
27
286
360

score 1 · Answer 3 · answered Apr 19 '20 at 05:18

Hope this helps :) You need to first define the pattern in compile and further use the find all function to extract the string. You can also do it in one line by defining the pattern in findall function as @Tim Biegeleisen suggested.

import re
bold_pattern = re.compile(r'\*\*(.*?)\*\*') 
underline_pattern = re.compile(r'\_\_(.*?)\_\_')
a = "__Hello__ **This** __is__ **Lego**"
print(bold_pattern.findall(a))
print(underline_pattern.findall(a))

score 1 · Answer 4 · answered Apr 19 '20 at 09:38

Suggestion:

If you're dealing with multiline text (i.e. \n), then you'll need to pass the argument: flags=re.DOTALL to your re.findall() method.

Case: Multiline text

# string to be searched
a = """
__Hello__ **This 
is a multiline test** __it is__ **Lego
**
"""

# pattern variations
bold_pattern = r'\*\*(.*?)\*\*'

# call re functions
match = re.findall(pattern=bold_pattern, string=a)
flag_match = re.findall(pattern=bold_pattern, string=a, flags=re.DOTALL)

# print results for observation
print(match)
print(flag_match) # using the flag

Returns:

[' __it is__ ']
['This \nis a multiline test', 'Lego\n']

From the Python 3.8.2 documentation:
"The expression’s behaviour can be modified by specifying a flags value."

Dealing with (\n)

Depending on your needs, there are a few different ways you can deal with \n. If I need to, I'll use re.sub() on the entire text body prior to doing anything else to remove them all.

To Compile or Not to Compile?

From the Python 3.8.2 documentation:
"Some of the functions are simplified versions of the full featured methods for compiled regular expressions. Most non-trivial applications always use the compiled form...
...but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program."

and

"The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions."

So unless you're using a whole bunch of patterns, you shouldn't see a noticable improvement from compiling.

You can also use the %%time magic command to test both options and see if you notice an advantage locally!

Good luck!

Finding correct regex for a bolded/underlined strings (Python)

4 Answers4

Suggestion:

Case: Multiline text

Dealing with (\n)

To Compile or Not to Compile?