regex expression for extracting a base file name from a path

Question

I would like to get the alphabetic parts of a file from some file paths.

files = ['data/Conversion/201406/MM_CLD_Conversion_Advertiser_96337_Daily_140606.zip', 
         'data/Match/201406/MM_CLD_Match_Advertiser_111423_Daily_140608.csv.zip', 
         'data/AQlog/201406/orx140605.csv.zip',
         'data/AQlog/201406/orx140605.csv.zip/']

Currently I do this:

strip end slashes
os.path.split()[1] to get filename
two os.path.splitext() to remove a possible 2 file extensions
lose the numbers
lose the underscores

Code:

for f in files:
    a = os.path.splitext(os.path.splitext(os.path.split(f.rstrip('/\\'))[1])[0])[0]
    b = re.sub('\d+', '', a).replace('_','')

Result:

'MMCLDConversionAdvertiserDaily'
'MMCLDMatchAdvertiserDaily'
'orx'
'orx'

Is there a faster or more pythonic way, using a compiled regex function? Or is trying to use the library os.path() very reasonable? I also do not have to do this more than 100 times, so it's not a speed problem, this is just for clarity.

Why do you ask "Is there a faster… way" when in the very next sentence you say "I also do not have to do this more than 100 times, so it's not a speed problem"? If speed isn't an issue, why ask for speed? — abarnert, Sep 02 '14 at 20:19
Also, do you know how to find the docs? If so, why use, e.g., [`os.path.split()[1]`](https://docs.python.org/3/library/os.path.html#os.path.split) instead of [`os.path.basename()`](https://docs.python.org/3/library/os.path.html#os.path.basename)? — abarnert, Sep 02 '14 at 20:20
hi @abarnert! I'm simply curious about the speed, as I'm under the impression compiled regex functions are very fast. And yep, missed `os.path.basename()`, as I've played around with os.path before and didn't look carefully at the docs this time around. — ehacinom, Sep 02 '14 at 20:24

score 2 · Answer 1 · edited May 23 '17 at 10:33

Without using regular expressions:

import os
import string
trans = string.maketrans('_', ' ')
def get_filename(path):
    # If you need to keep the directory, use os.path.split
    filename = os.path.basename(path.rstrip('/'))
    try:
        # If the extension starts at the last period, use
        # os.path.splitext
        # If the extension starts at the 2nd to last period,
        # use os.path.splitext twice
        # Continuing this pattern (since it sounds like you
        # don't know how many extensions a filename may have,
        # it may be safer to assume the file extension starts
        # at the first period. In which case, use
        # filename.split('.', 1).
        filename_without_ext, extension = filename.split('.', 1)
    except ValueError:
        filename_without_ext = filename
        extension = ''
    filename_cleaned = filename_without_ext.translate(trans, string.digits)
    return filename_cleaned

>>> path = 'data/Match/201406/MM_CLD_Match_Advertiser_111423_Daily_140608.csv.zip/'
>>> get_filename(path)
'MM CLD Match Advertiser  Daily '

Do whatever approach is more readable. I usually avoid regular expressions if the problem doesn't require it. In this case, regular string operations can do everything you want to do.

If you want to remove extra spaces (as indicated in your Result), use filename.replace(' ', ''). If you are likely to have other kinds of whitespace, it can be removed by ''.join(filename.split())

Note: If you are using Python 3, replace trans=string.maketrans('_', ' ') with trans=str.maketrans('_', ' ', string.digits), and filename_without_ext.translate(trans, string.digits) with filename_without_ext.translate(trans). This change was made as part of improving unicode language support. See more: How come string.maketrans does not work in Python 3.1?

Here's the Python 3 code:

import os
import string
trans = string.maketrans('_', ' ', string,digits)
def get_filename(path):
    filename = os.path.basename(path.rstrip('/'))
    filename_without_ext = filename.split('.', 1)[0]
    filename_cleaned = filename_without_ext.translate(trans)
    return filename_cleaned

You're not stripping out the numbers, which the OP asked for, and is doing—and which is a perfect example of the kind of thing regexps are good for and regular string operations are not… — abarnert, Sep 02 '14 at 20:22
OK, it now works—although it's also now more than 80 characters wide, and doesn't work in Python 3.x anymore. I'm not sure it's worth going to such extents to avoid regular expressions (any more than it's worth going to ridiculous extents to use them even when they're unnecessary, as so many people do…), but it's at least worth showing how to do so, so definitely +1. (But it would be better to be PEP8 compliant and not have a horizontal scrollbar, and to explain the 2.x-3.x difference you've added.) — abarnert, Sep 02 '14 at 20:37
Your last paragraph isn't what you want for Python 3. Your filenames are `str`, not `bytes`, so you can't use a `bytes.maketrans`. What you want is `str.maketrans`. But, more importantly, `str.translate` no longer takes a second argument; you want to pass the characters to be deleted as a third argument to `str.maketrans` instead. (Also, in both 2.x and 3.x, you might as well use `string.digits`.) — abarnert, Sep 02 '14 at 20:57
@abarnert Hmm.. There are more differences in Python3 than I thought. I'm still using Python2 until Jython support catches up. — IceArdor, Sep 02 '14 at 21:12

abarnert · Accepted Answer · 2014-09-03T00:29:06.460

You can simplify this by using the appropriate functions from os.path.

First, f you call normpath you no longer have to worry about both kinds of path separators, just os.sep (note that this is a bad thing if you're trying to, e.g., process Windows paths on Linux… but if you're trying to process native paths on any given platform, it's exactly what you want). And it also removes any trailing slashes.

Next, if you call basename instead of split, you no longer have to throw in those trailing [1]s.

Unfortunately, there's no equivalent of basename vs. split for splitext… but you can write one easily, which will make your code more readable in the exact same way as using basename.

As for the rest of it… regexp is the obvious way to strip out any digits (although you really don't need the + there). And, since you've already got a regexp, it might be simpler to toss the _ in there instead of doing it separately.

So:

def stripext(p):
    return os.path.splitext(p)[0]

for f in files:
    a = stripext(stripext(os.path.basename(os.path.normpath(f))))
    b = re.sub(r'[\d_]', '', a)

Of course the whole thing is probably more readable if you wrap if up as a function:

def process_path(p):
    a = stripext(stripext(os.path.basename(os.path.normpath(f))))
    return re.sub(r'[\d_]', '', a)

for f in files:
    b = process_path(f)

Especially since you can now turn your loop into a list comprehension or generator expression or map call:

processed_files = map(process_path, files)

I'm simply curious about the speed, as I'm under the impression compiled regex functions are very fast.

Well, yes, in general. However, uncompiled string patterns are also very fast.

When you use a string pattern instead of a compiled regexp object, what happens is this:

The re module looks up the pattern in a cache of compiled regular expressions.
If not found, the string is compiled and the result added to the cache.

So, assuming you don't use many dozens of regular expressions in your app, either way, your pattern gets compiled exactly once, and run as a compiled expression repeatedly. The only additional cost to using the uncompiled expressions is looking it up in that cache dictionary, which is incredibly cheap—especially when it's a string literal, so it's guaranteed to be the exact same string object every time, so its hash will be cached as well, so after the first time the dict lookup turns into just a mod and an array lookup.

For most apps, you can just assume the re cache is good enough, so the main reason for deciding whether to pre-compile regular expressions or not is readability. For example, when you've got, e.g., a function that runs a slew of complicated regular expressions whose purpose is hard to understand, it can definitely help to give each one of them a name, so you can write for r in (r_phone_numbers, r_addresses, r_names): …, in which case it would be almost silly not to compile them.

regex expression for extracting a base file name from a path

2 Answers2