3

Please point me to a post if one already exists for this question.

How might I efficiently add in word boundary syntax to list of strings?

So for instance, I want to make sure the words below in badpositions only match a word in their entirety so I'd like to use re.search('\bword\b', text).

How do I get the words in bad positions to take the form ['\bPresident\b', '\bProvost\b'] etc

text = ['said Duke University President Richard H. Brodhead. "Our faculty look forward']
badpositions = ['President', 'Provost', 'University President', 'Senior Vice President'] 
Jerry
  • 70,495
  • 13
  • 100
  • 144
user3314418
  • 2,903
  • 9
  • 33
  • 55
  • Use a loop and follow [this](http://stackoverflow.com/questions/6930982/variable-inside-python-regex). – tenub Feb 18 '14 at 18:15
  • 1
    If you can do a search loop, its a much faster search when all the strings are joined into a single regex. The engine sets up a Trie. Example `\b(?:President|Provost|University President|)\b`. Typically, just involves using a join and string concatenation to create the regex string. –  Feb 18 '14 at 19:11
  • 1
    It's customary to accept the response that answers your question! – Russia Must Remove Putin Feb 18 '14 at 19:27

1 Answers1

6
re_badpositions = [r"\b{word}\b".format(word=word) for word in badpositions]

indexes = {badpositions[i]:re.search(re_badpositions[i],text) for i in range(len(badpositions))}

If I understand you correctly, you're looking to find the starting index of all words that match exactly (that is, \bWORD\b) in your text string. This is how I'd do that, but I'm certainly adding a step here, you could just as easily do:

indexes = {word: re.search("\b{word}\b".format(word=word),text) for word in badpositions}

I find it a little more intelligible to create a list of regexes to search with, then search by them separately than to plunk those regexes in place at the same time. This is ENTIRELY due to personal preference, though.

Adam Smith
  • 52,157
  • 12
  • 73
  • 112
  • Thanks! for some reason, i thought .format() only works for print statements, but I see now that it's a general tool to replace strings dynamically. – user3314418 Feb 18 '14 at 18:26
  • I personally think that putting `word` in the curly braces is somewhat redundant. `r"\b{}\b".format(word)` seems just as readable IMO. Otherwise, great answer. (+1) –  Feb 18 '14 at 18:29
  • @iCodez again, more a matter of personal preference. I've learned to put everything explicitly within braces because often for testing I will do `logging.info("x is {x} and y is {y} and z is {z}".format(**locals()))` then change it to what I actually need afterwards. – Adam Smith Feb 18 '14 at 18:34
  • How to apply the Word Boundary List of list values – dondapati Feb 04 '20 at 14:44