You can simplify this by using the appropriate functions from os.path
.
First, f you call normpath
you no longer have to worry about both kinds of path separators, just os.sep
(note that this is a bad thing if you're trying to, e.g., process Windows paths on Linux… but if you're trying to process native paths on any given platform, it's exactly what you want). And it also removes any trailing slashes.
Next, if you call basename
instead of split
, you no longer have to throw in those trailing [1]
s.
Unfortunately, there's no equivalent of basename
vs. split
for splitext
… but you can write one easily, which will make your code more readable in the exact same way as using basename
.
As for the rest of it… regexp is the obvious way to strip out any digits (although you really don't need the +
there). And, since you've already got a regexp, it might be simpler to toss the _
in there instead of doing it separately.
So:
def stripext(p):
return os.path.splitext(p)[0]
for f in files:
a = stripext(stripext(os.path.basename(os.path.normpath(f))))
b = re.sub(r'[\d_]', '', a)
Of course the whole thing is probably more readable if you wrap if up as a function:
def process_path(p):
a = stripext(stripext(os.path.basename(os.path.normpath(f))))
return re.sub(r'[\d_]', '', a)
for f in files:
b = process_path(f)
Especially since you can now turn your loop into a list comprehension or generator expression or map
call:
processed_files = map(process_path, files)
I'm simply curious about the speed, as I'm under the impression compiled regex functions are very fast.
Well, yes, in general. However, uncompiled string patterns are also very fast.
When you use a string pattern instead of a compiled regexp object, what happens is this:
- The
re
module looks up the pattern in a cache of compiled regular expressions.
- If not found, the string is compiled and the result added to the cache.
So, assuming you don't use many dozens of regular expressions in your app, either way, your pattern gets compiled exactly once, and run as a compiled expression repeatedly. The only additional cost to using the uncompiled expressions is looking it up in that cache dictionary, which is incredibly cheap—especially when it's a string literal, so it's guaranteed to be the exact same string object every time, so its hash will be cached as well, so after the first time the dict lookup turns into just a mod
and an array lookup.
For most apps, you can just assume the re
cache is good enough, so the main reason for deciding whether to pre-compile regular expressions or not is readability. For example, when you've got, e.g., a function that runs a slew of complicated regular expressions whose purpose is hard to understand, it can definitely help to give each one of them a name, so you can write for r in (r_phone_numbers, r_addresses, r_names): …
, in which case it would be almost silly not to compile them.