1

I have a list of keywords to search for. Most of them are case insensitive, but a few of them are case sensitive such as IT or I.T. for information technology. Usually, I join all the keywords together with "|", and set the flag to re.I. This will cause trouble for the case-sensitive keywords. Is there an easy way to get around this? Or I have to run a separate search for the case-sensitive ones? Thank you!

 keywords = ["internal control", "IT",... and many more]
 patterns = r"\b(" + "|".join(keywords) + r")\b"
 m = re.findall(patterns, text, flags = re.I)
eyllanesc
  • 235,170
  • 19
  • 170
  • 241
Victor Wang
  • 765
  • 12
  • 26
  • 2
    among the _many more_ `keywords`, how can you tell which ones are case-sensitive and which ones aren't? Would all-caps mean the word is case-sensitive? – SanV Jun 19 '19 at 03:59
  • The abbreviations are case-sensitive(all-cap). The rest are case insensitive. – Victor Wang Jun 19 '19 at 04:05
  • I think you would need two separate searches, which you could implement from the same keywords list by checking whether a particular keyword is all upper-case or not. Is regex the only option for you or have you thought about some other substring search options? – SanV Jun 19 '19 at 04:20
  • I will go with two separate searches then. regex seems to be more efficient, dealing with a few hundred keywords. – Victor Wang Jun 19 '19 at 04:27

2 Answers2

2

You can use (?-i:...) modifier to turn off case-insensitive search for this group. But it works only on Python 3.6+:

import re

s = "Internal control, it IT it's, Keyword2"
keywords = ["internal control", "IT", "keyword2"]
pattern = '|'.join(r'((?-i:\b{}\b))'.format(re.escape(k)) if k.upper() == k else r'(\b{}\b)'.format(re.escape(k)) for k in keywords)
print(re.findall(pattern, s, flags=re.I))

Prints:

[('Internal control', '', ''), ('', 'IT', ''), ('', '', 'Keyword2')]

From Python 3.6 documentation:

(?imsx-imsx:...)

(Zero or more letters from the set 'i', 'm', 's', 'x', optionally followed by '-' followed by one or more letters from the same set.) The letters set or removes the corresponding flags: re.I (ignore case), re.M (multi-line), re.S (dot matches all), and re.X (verbose), for the part of the expression. (The flags are described in Module Contents.)

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
1

(Posting this as an answer because it is too much text for a comment)
I still think two separate searches would be cleaner and simpler. So this may be academic: you could possibly use some combination of Conditional regex and optional mode modifiers as indicated in the respective links.

SanV
  • 855
  • 8
  • 16