0

i have the following code that I'd like to optimize:

if re.search(str(stringA), line) and re.search(str(stringB), line):
    .....
    .....

I tried:

stringAB = stringA + '.*' + stringB
if re.search(str(stringAB), line):
    .....
    .....

But the results I get is not reliable. I'm using "re.search" here because it seems to be the only way i can search for the exact regex of the pattern specified in stringA and stringB.

The logic behind this code is modeled after this egrep command example:

stringA=Success
stringB=mysqlDB01

egrep "${stringA}" /var/app/mydata | egrep "${stringB}"

If there's a better way to do this without re.search, please let me know.

RoyMWell
  • 199
  • 1
  • 9
  • What type of object are `stringA` and `stringB`? Presumably they aren't actually strings because you're calling `str()` on them. – PM 2Ring Jul 15 '18 at 09:20
  • they are strings. i'm calling str() to ensure python treats them as strings. and by strings, i mean, any pattern that a user may want to search for in a file. – RoyMWell Jul 15 '18 at 09:34
  • 1
    If `s` is already a string then Python already knows it's a string object. `str(s)` simply returns `s`. – PM 2Ring Jul 15 '18 at 09:37
  • 2
    Are you missing hits because `stringA` does not always come before `stringB`? (Which that attempt suggests.) By the way: `if x and y` should already be optimized as much as possible, so perhaps you are attempting premature optimization here. – Jongware Jul 15 '18 at 09:40
  • 1
    It's not possible to make your solution more efficient. It already does the bare minimum amount of work that's required to get the desired result. (Except for needlessly calling `str` on `stringA` and `stringB`.) – Aran-Fey Jul 15 '18 at 10:27
  • Related/Dupe: [Regular expression to find two strings anywhere in input](//stackoverflow.com/q/2219830) – Aran-Fey Jul 15 '18 at 10:32

1 Answers1

1

One way to do this is to make a pattern that matches either word (using \b so we only match complete words), use re.findall to check the string for all matches, and then use set equality to ensure that both words have been matched.

import re

stringA = "spam"
stringB = "egg"

words = {stringA, stringB}

# Make a pattern that matches either word
pat = re.compile(r"\b{}\b|\b{}\b".format(stringA, stringB))

data = [
    "this string has spam in it",
    "this string has egg in it",
    "this string has egg in it and another egg too",
    "this string has both egg and spam in it",
    "the word spams shouldn't match",
    "and eggs shouldn't match, either",
]

for s in data:
    found = pat.findall(s)
    print(repr(s), found, set(found) == words)   

output

'this string has spam in it' ['spam'] False
'this string has egg in it' ['egg'] False
'this string has egg in it and another egg too' ['egg', 'egg'] False
'this string has both egg and spam in it' ['egg', 'spam'] True
"the word spams shouldn't match" [] False
"and eggs shouldn't match, either" [] False

A slightly more efficent way to do set(found) == words is to use words.issubset(found), since it skips the explicit conversion of found.


As Jon Clements mentions in a comment, we can simplify and generalize the pattern to handle any number of words, and we should use re.escape, just in case any of the words contain regex metacharacters.

pat = re.compile(r"\b({})\b".format("|".join(re.escape(word) for word in words)))

Thanks, Jon!


Here's a version that matches the words in the specified order. If it finds a match it prints the matching substring, otherwise it prints None.

import re

stringA = "spam"
stringB = "egg"
words = [stringA, stringB]

# Make a pattern that matches all the words, in order
pat = r"\b.*?\b".join([re.escape(word) for word in words])
pat = re.compile(r"\b" + pat + r"\b")

data = [
    "this string has spam and also egg, in the proper order",
    "this string has spam in it",
    "this string has spamegg in it",
    "this string has egg in it",
    "this string has egg in it and another egg too",
    "this string has both egg and spam in it",
    "the word spams shouldn't match",
    "and eggs shouldn't match, either",
]

for s in data:
    found = pat.search(s)
    if found:
        found = found.group()
    print('{!r}: {!r}'.format(s, found))

output

'this string has spam and also egg, in the proper order': 'spam and also egg'
'this string has spam in it': None
'this string has spamegg in it': None
'this string has egg in it': None
'this string has egg in it and another egg too': None
'this string has both egg and spam in it': None
"the word spams shouldn't match": None
"and eggs shouldn't match, either": None
Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • Might be worthwhile generalising `pat` so it's something like: `r'\b{}\b'.format('|'.join(re.escape(word) for word in words))` ? – Jon Clements Jul 15 '18 at 09:50
  • Although it doesn't matter here - you could possibly make use of `.finditer` to avoid creation of a list... eg: `words.issubset(m.group() for m in pat.finditer(s))` – Jon Clements Jul 15 '18 at 09:54
  • @JonClements Good thinking! I didn't use `re.escape` originally, since I figured that the strings might already be regexes, but I guess it is a Good Idea. But I won't bother with `.finditer`, since there's probably not much benefit if the OP is searching single lines of text. – PM 2Ring Jul 15 '18 at 10:06
  • im actually using `with open(logfile) as f`, to iterate through a huge log file and to search for the two patterns on the each line of log that was read. the strings must appear in the order specified. `stringA then stringB`. although, i can imagine a scenario where a user would want it reversed. so i wonder if `.finditer` could help fasten the process of reading a huge log file and checking each line for the two patterns? – RoyMWell Jul 15 '18 at 10:15
  • @RoyMWell Ah. You should mention that the matches have to occur in the specified order in your question. The hardest part of this task was matching the words in any order. :) – PM 2Ring Jul 15 '18 at 10:18
  • 1
    @RoyMWell Please see the updated version at the end of my answer. Since you need to search line by line `.finditer` isn't much benefit here: it's useful when each string to be searched is many kilobytes and contains lots of matches. – PM 2Ring Jul 15 '18 at 10:51
  • generalizing with ored regexes works to some extent. If you have 1000+ words, it amounts to a slooow linear search, in that case it's better to match all words and then compare the words from a `set` or `dict` – Jean-François Fabre Jul 15 '18 at 11:41
  • @Jean-FrançoisFabre Makes sense. When you say "match all words", do you mean using a simple pattern like `r"\b\w+\b"`? – PM 2Ring Jul 15 '18 at 11:50
  • @Jean-FrançoisFabre not necessarily... I'm not 100% sure if Python does it but generally for a simple regex like here - the regex engine can create a prefix tree and very quickly determine non-matches and possible matches... – Jon Clements Jul 15 '18 at 11:54
  • @JonClements we created a code obfuscator at work, and using the OR technique with 2000+ words took hours, because it's linear. There's no way regex is smart enough to work letter by letter tree-like. Maybe `regex` module can do it. Dunno. – Jean-François Fabre Jul 15 '18 at 13:37
  • @Jean-FrançoisFabre you seem to be getting different mileage than something like https://stackoverflow.com/a/50285809 then - see the comment from OP. – Jon Clements Jul 15 '18 at 14:00
  • @JonClements yeah, good but not the same case (it matches sentences). My solution (can't find it right now, maybe I just did that for my work, what a waste :)) my solution wouldn't work with spaces in the patterns anyway. – Jean-François Fabre Jul 15 '18 at 14:17