re.sub a list of words, ignore case

Question

I am trying to add the html  element to a list of words in a sentence. After doing some search I got it almost working, except the ignore-case.

import re

bolds = ['test', 'tested']  # I want to bold these words, ignoring-case
text = "Test lorem tested ipsum dolor sit amet test, consectetur TEST adipiscing elit test."

pattern = r'\b(?:' + "|".join(bolds) + r')\b'
dict_repl = {k: f'<b>{k}</b>' for k in bolds}
text_bolded = re.sub(pattern, lambda m: dict_repl.get(m.group(), m.group()), text)
print(text_bolded)

Output:

Test lorem tested ipsum dolor sit amet test, consectetur TEST adipiscing elit test.

This output misses the  element for Test and TEST. In other words, I would like the output to be:

Test lorem tested ipsum dolor sit amet test, consectetur TEST adipiscing elit test.

One hack is that I explicitly add the capitalize and upper, like so ...

bolds = bolds + [b.capitalize() for b in bolds] + [b.upper() for b in bolds]

But I am thinking there must be a better way to do this. Besides, the above hack will miss words like tesT, etc.

Thank you!

I think you can probably just ad `re.I` as the last parameter in your `re.sub` function — Alexander, Mar 07 '23 at 00:09
@Alexander Actually I had tried `re.I` and `re.IGNORECASE`, but unfortunately they do not give the desired result. In fact if I do that, it makes it worse (it misses the last `test` in `... elit test.`! I am not too savvy on `re` but have tried a lot of things but no luck :( — tikka, Mar 07 '23 at 00:17
This is a common mistake. You need `flags=re.I`. If you don't use `flags=`, you're setting the next positional argument, which is the max number of replacements. — Barmar, Mar 07 '23 at 00:29
@Barmar great point, but unfortunately even that did not work. Just to clarify I did this `text_bolded = re.sub(pattern, lambda m: dict_repl.get(m.group(), m.group()), text, flags=re.I)` and it still did not give the desired result. Wondering if I am still missing something? — tikka, Mar 07 '23 at 00:38
@Barmar The issue with this is that it also makes the original text lower, which is not ideal. Although, I can do this and then explicitly get the placement indices of the `` element and copy it over to the original sentence (to maintain the original casing). — tikka, Mar 07 '23 at 00:47
Why are you using a dictionary? You're replacing everything with the same thing. — Barmar, Mar 07 '23 at 00:49
@Barmar I am just adding the html `` element, like `k: f'{k}'` to basically bold those words. — tikka, Mar 07 '23 at 00:51

Barmar · Accepted Answer · 2023-03-07T01:08:51.243

2

There's no need for the dictionary or function. All the replacements are simple string wrapped around the original string, you can get that with a back-reference.

Use flags=re.I to make the match case-insensitive.

text_bolded = re.sub(pattern, r'<b>\g<0></b>', text, flags=re.I)

\g<0> is a back-reference that returns the full match of the pattern.

edited Mar 07 '23 at 01:08

answered Mar 07 '23 at 00:53

Barmar

741,623
53
500
612

re.sub a list of words, ignore case

1 Answers1