Judging by your inputs, your desired regular expression needs to meet the following criteria:
- Match the number provided, exactly
- Ignore number matches in the file extension, if present
- Handle file names that include spaces
I think this will meet all these requirements:
def generate(n):
return re.compile(r'^[^.\d]*' + str(n) + r'[^.\d]*(\..*)?$')
def check_files(n, files):
regex = generate(n)
return [f for f in files if regex.fullmatch(f)]
Usage:
>>> check_files(4, ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4'])
['ep 4', 'img4', '4xxx', 'file 4.mp4']
Note that this solution involves creating a Pattern object and using that object to check each file. This strategy offers a performance benefit over calling re.fullmatch
with the pattern and filename directly, as the pattern does not have to be compiled for each call.
This solution does have one drawback: it assumes that filenames are formatted as name.extension
and that the value you're searching for is in the name
part. Because of the greedy nature of regular expressions, if you allow for file names with .
then you won't be able to exclude extensions from the search. Ergo, modifying this to match ep.4
would also cause it to match file.mp4
. That being said, there is a workaround for this, which is to strip extensions from the file name before doing the match:
def generate(n):
return re.compile(r'^[^\d]*' + str(n) + r'[^\d]*$')
def strip_extension(f):
return f.removesuffix('.mp4')
def check_files(n, files):
regex = generate(n)
return [f for f in files if regex.fullmatch(strip_extension(f))]
Note that this solution now includes the .
in the match condition and does not exclude an extension. Instead, it relies on preprocessing (the strip_extension
function) to remove any file extensions from the filename before matching.
As an addendum, occasionally you'll get files have the number prefixed with zeroes (ex. 004, 0001, etc.). You can modify the regular expression to handle this case as well:
def generate(n):
return re.compile(r'^[^\d]*0*' + str(n) + r'[^\d]*$')