3

I am trying to search through an apache log file in order to pull out lines that do not have certain strings ("session" and "curl") and the lines must have a particular month string ("Dec"). The searches work on their own:

re.search("^((?!session|curl).)*$", f[line])
re.search(r'Dec', f[line])

I am wondering if I can get away with combining them in a single join? I tried this

re.search('|'.join('(?:{0})'.format(x) for x in (r'Dec', r'/^((?!session|curl).)*/$')), f[line])

I am expecting to see lines with the correct month, and to have lines with the strings "session" and "curl" excluded, but instead all the lines are returned.

Please what am I doing wrong?

Unpossible
  • 603
  • 6
  • 23

1 Answers1

1

Yes, it is possible. You need to construct a regex like

^(?!.*(?:session|curl)).*Dec

See the regex demo. Details:

  • ^ - start of string
  • (?!.*(?:session|curl)) - no session or curl should appear on the line (if you add a DOTALL modifier, the whole string will be considered)
  • .*Dec - any 0+ chars (other than line break chars if the DOTALL modifier is not used), as many as possible, up to the last occurrence of a Dec substring.

Add word boundaries (\b) around the group/word if whole word match is required.

Sample Python demo:

import re
words = ['session', 'curl']
month = 'Dec'
x = '|'.join([re.escape(w) for w in words])
m = re.search(r'^(?!.*({})).*{}'.format(x, month), 'Date: Dec 2016')
if m:
    print('Matched')
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thank you! This worked, I put my data in the demo and it came out great! – Unpossible Dec 16 '16 at 08:40
  • Just FYI: `^((?!session|curl).)*$` is a very resource consuming construct (a [tempered greedy token](http://stackoverflow.com/a/37343088/3832970)). Avoid it if possible, use simple lookaheads if you need to match some string other than some other string. – Wiktor Stribiżew Dec 16 '16 at 08:51