How do I write a regex that excludes certain file suffixes?

Question

I'm looking at the tutorial given here :-

https://docs.python.org/2/howto/regex.html#lookahead-assertions

I want to exclude files that end in .pqr.gz and I'm not quite sure how to do that.

e.g., the expected behaviour is :-

f1.gz => succeed
f1.abc.pqr => succeed
f1.pqr.gz => fail
f1.abc.gz => succeed

The best regex I could come up with was :-

r'.*[.](?=[^.]*[.][^.]*)(?!pqr[.]gz$)[^.]*[.][^.]*$'

This excludes files that end in .pqr.gz but doesn't for example allow files that are just f1.gz (i.e. first case I wrote above).

Any ideas on how this can be improved?

EDIT :- There are better ways to do this (e.g., using string.endswith), but I'm curious about how to do this with a regex purely as an exercise.

@Rawing That works. Can you write that up as an answer (hopefully with an explanation) and I'll accept it. — owagh, Jan 19 '17 at 20:59

zmo · Answer 1 · 2017-01-19T21:00:19.127

0

well, TBH, your use of regex seems overkill to me. You could simply do:

if not '.pqr.gz' in line:
    print(line)

and done.

Actually, "simple" string manipulation can do a lot in just a few simple operations, like:

for line in lines:
    file, result = line.split(' => ')
    if file.endswith('.pqr.gz'):
        print("Skipping file {}".format(file), file=sys.stderr)
        continue
    print(file)
    # and you could do something if result == "success" there after!

as you insist on doing it with regexps:

here's your current regex representation

And here's a solution as inspired from @rawing suggestion:

.*(?<!\.pqr\.gz) =>

edited Jan 19 '17 at 21:00

answered Jan 19 '17 at 20:50

zmo

24,463
4
54
90

i think i should mention that it's more of a mental exercise in using regexes than for any practical purpose. – owagh Jan 19 '17 at 20:54
but you want to filter out extensions not being gz or pqr right? – Jean-François Fabre Jan 19 '17 at 20:58
(too bad the images link are dead… what's happening with debuggex ? ☹) – zmo Jan 19 '17 at 21:04

score -1 · Answer 2 · answered Jan 19 '17 at 21:44

-1

One thing to be aware of with Python's re module is that re.match implicitly anchors to the start of the string.

Also, you can match literal periods by escaping them (\.), which is probably easier to read (and potentially faster) than putting it in a character class.

For re.match the following regex should do the trick:

r'.*\.pqr\.gz$'

If using re.search instead, the regex can be shortened to just this:

r'\.pqr\.gz$'

answered Jan 19 '17 at 21:44

KingRadical

1,282
11
8

Another thing to be aware of with re.match is that you have to provide your own anchor to the END of the string. There are 2 choices, `\Z` and `$` ... `$` is there as a hangover from perl. Use `\Z` – John Machin Jan 29 '17 at 22:36
Again, `\Z` is only preferable if you explicitly want trailing newlines to be taken into account by the match statement. `$` is not just a hangover from perl, it is a different anchor that is also useful. For example, if you are trying to match to the end of a line rather than the end of a string, especially when using `flags=re.MULTILINE`, `\Z` is the wrong choice. – KingRadical Jan 30 '17 at 20:19
Again look at the OP's question ... wants strings that end in "foo", not "foo\n" – John Machin Jan 30 '17 at 21:08

How do I write a regex that excludes certain file suffixes?

2 Answers2