0

I'm looking at the tutorial given here :-

https://docs.python.org/2/howto/regex.html#lookahead-assertions

I want to exclude files that end in .pqr.gz and I'm not quite sure how to do that.

e.g., the expected behaviour is :-

f1.gz => succeed
f1.abc.pqr => succeed
f1.pqr.gz => fail
f1.abc.gz => succeed

The best regex I could come up with was :-

r'.*[.](?=[^.]*[.][^.]*)(?!pqr[.]gz$)[^.]*[.][^.]*$'

This excludes files that end in .pqr.gz but doesn't for example allow files that are just f1.gz (i.e. first case I wrote above).

Any ideas on how this can be improved?

EDIT :- There are better ways to do this (e.g., using string.endswith), but I'm curious about how to do this with a regex purely as an exercise.

martineau
  • 119,623
  • 25
  • 170
  • 301
owagh
  • 3,428
  • 2
  • 31
  • 53

2 Answers2

0

well, TBH, your use of regex seems overkill to me. You could simply do:

if not '.pqr.gz' in line:
    print(line)

and done.

Actually, "simple" string manipulation can do a lot in just a few simple operations, like:

for line in lines:
    file, result = line.split(' => ')
    if file.endswith('.pqr.gz'):
        print("Skipping file {}".format(file), file=sys.stderr)
        continue
    print(file)
    # and you could do something if result == "success" there after!

as you insist on doing it with regexps:

here's your current regex representation

Regular expression visualization

And here's a solution as inspired from @rawing suggestion:

.*(?<!\.pqr\.gz) =>

Regular expression visualization

zmo
  • 24,463
  • 4
  • 54
  • 90
-1

One thing to be aware of with Python's re module is that re.match implicitly anchors to the start of the string.

Also, you can match literal periods by escaping them (\.), which is probably easier to read (and potentially faster) than putting it in a character class.

For re.match the following regex should do the trick:

r'.*\.pqr\.gz$'

If using re.search instead, the regex can be shortened to just this:

r'\.pqr\.gz$'
KingRadical
  • 1,282
  • 11
  • 8
  • Another thing to be aware of with re.match is that you have to provide your own anchor to the END of the string. There are 2 choices, `\Z` and `$` ... `$` is there as a hangover from perl. Use `\Z` – John Machin Jan 29 '17 at 22:36
  • Again, `\Z` is only preferable if you explicitly want trailing newlines to be taken into account by the match statement. `$` is not just a hangover from perl, it is a different anchor that is also useful. For example, if you are trying to match to the end of a line rather than the end of a string, especially when using `flags=re.MULTILINE`, `\Z` is the wrong choice. – KingRadical Jan 30 '17 at 20:19
  • Again look at the OP's question ... wants strings that end in "foo", not "foo\n" – John Machin Jan 30 '17 at 21:08