2

I have a folder where each file is named after a number (i.e. img 1, img 2, img-3, 4-img, etc). I want to get files by exact string (so if I enter '4' as an input, it should only return files with '4' and not any files containing '14' or 40', for example. My problem is that the program returns all files as long as it matches the string. Note, the numbers aren't always in the same spot (for same files its at the end, for others it's in the middle)

For instance, if my folder has the files ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4', 'ep.4.', 'ep.4 ', 'ep. 4. ',ep4xxx, 'ep 4 ', '404ep'],and I want only files with the exact number 4 in them, then I would only want to return ['ep 4', 'img4', '4xxx','file 4.mp4','ep.4.','ep.4 ', 'ep. 4. ',ep4xxx,'ep 4 ','404ep]

here is what I have (in this case I only want to return all mp4 file type)

for (root, dirs, file) in os.walk(source_folder):
    for f in file:
        if '.mp4' and ('4') in f:
            print(f)

Tried == instead of in

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
bull11trc
  • 61
  • 5
  • 3
    `if '.mp4' and ('4') in f` That is not the way to check for multiple conditions. Use this instead:`if 'mp4' in f and '4' in f`. However, in this case, "4" is already in "mp4", so that specific condition is useless. – John Gordon Dec 09 '22 at 01:19
  • "4" is just an example, I also want only files with 5, 6, etc – bull11trc Dec 09 '22 at 03:35

3 Answers3

1

Judging by your inputs, your desired regular expression needs to meet the following criteria:

  1. Match the number provided, exactly
  2. Ignore number matches in the file extension, if present
  3. Handle file names that include spaces

I think this will meet all these requirements:

def generate(n):
    return re.compile(r'^[^.\d]*' + str(n) + r'[^.\d]*(\..*)?$')

def check_files(n, files):
    regex = generate(n)
    return [f for f in files if regex.fullmatch(f)]

Usage:

>>> check_files(4, ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4'])
['ep 4', 'img4', '4xxx', 'file 4.mp4']

Note that this solution involves creating a Pattern object and using that object to check each file. This strategy offers a performance benefit over calling re.fullmatch with the pattern and filename directly, as the pattern does not have to be compiled for each call.

This solution does have one drawback: it assumes that filenames are formatted as name.extension and that the value you're searching for is in the name part. Because of the greedy nature of regular expressions, if you allow for file names with . then you won't be able to exclude extensions from the search. Ergo, modifying this to match ep.4 would also cause it to match file.mp4. That being said, there is a workaround for this, which is to strip extensions from the file name before doing the match:

def generate(n):
    return re.compile(r'^[^\d]*' + str(n) + r'[^\d]*$')

def strip_extension(f):
    return f.removesuffix('.mp4')

def check_files(n, files):
    regex = generate(n)
    return [f for f in files if regex.fullmatch(strip_extension(f))]

Note that this solution now includes the . in the match condition and does not exclude an extension. Instead, it relies on preprocessing (the strip_extension function) to remove any file extensions from the filename before matching.

As an addendum, occasionally you'll get files have the number prefixed with zeroes (ex. 004, 0001, etc.). You can modify the regular expression to handle this case as well:

def generate(n):
    return re.compile(r'^[^\d]*0*' + str(n) + r'[^\d]*$')
Woody1193
  • 7,252
  • 5
  • 40
  • 90
  • I disagree with your expected output and I think `file.mp4` should also be returned. – Tim Biegeleisen Dec 09 '22 at 02:27
  • @TimBiegeleisen It's not in the list of output as described by the OP. – Woody1193 Dec 09 '22 at 03:04
  • how do I account for ep.2. ? – bull11trc Dec 09 '22 at 06:06
  • @bull11trc You can't with this solution. – Woody1193 Dec 09 '22 at 06:19
  • those examples I initially listed were not inclusive, just meant to give an idea of what I am looking for. I basically want only an exact match (meaning 4 matches only with 4 and not 40 or mp4, and 40 wouldn't match with 402, for instance). But I've edited the "scope" of the files, I'm sure I am missing some. – bull11trc Dec 09 '22 at 06:27
  • @bull11trc The problem with regular expressions is that they require you to have some idea of the structure of the data you're working with. In general, files are named as `name.extension`, so if you want to search for files with a specific number in the name without searching on the extension, then you need to remove that first. I've updated my answer with this in mind. – Woody1193 Dec 09 '22 at 06:29
  • 1
    @bull11trc I've updated my answer. I think this will give you what you're looking for. The one thing I will say is that if you have extensions you want to ignore other than `.mp4` then you'll want to add those in as well. – Woody1193 Dec 09 '22 at 06:46
0

We can use re.search along with a list comprehension for a regex option:

files = ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4']
num = 4
regex = r'(?<!\d)' + str(num) + r'(?!\d)'
output = [f for f in files if re.search(regex, f)]
print(output)  # ['ep 4', 'img4', '4xxx', 'file.mp4', 'file 4.mp4']
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
0

this can be accomplished with the following function

import os


files = ["ep 4", "xxx 3 ", "img4", "4xxx", "ep-40", "file.mp4", "file 4.mp4"]
desired_output = ["ep 4", "img4", "4xxx", "file 4.mp4"]


def number_filter(files, number):
    filtered_files = []
    for file_name in files:

        # if the number is not present, we can skip this file
        if file_name.count(str(number)) == 0:
            continue

        # if the number is present in the extension, but not in the file name, we can skip this file
        name, ext = os.path.splitext(file_name)

        if (
            isinstance(ext, str)
            and ext.count(str(number)) > 0
            and isinstance(name, str)
            and name.count(str(number)) == 0
        ):
            continue

        # if the number is preseent in the file name, we must determine if it's part of a different number
        num_index = file_name.index(str(number))

        # if the number is at the beginning of the file name
        if num_index == 0:
            # check if the next character is a digit
            if file_name[num_index + len(str(number))].isdigit():
                continue

        # if the number is at the end of the file name
        elif num_index == len(file_name) - len(str(number)):
            # check if the previous character is a digit
            if file_name[num_index - 1].isdigit():
                continue

        # if it's somewhere in the middle
        else:
            # check if the previous and next characters are digits
            if (
                file_name[num_index - 1].isdigit()
                or file_name[num_index + len(str(number))].isdigit()
            ):
                continue

        print(file_name)
        filtered_files.append(file_name)

    return filtered_files


output = number_filter(files, 4)

for file in output:
    assert file in desired_output

for file in desired_output:
    assert file in output

CpE_Sklarr
  • 11
  • 4