Python: Identifying numerically names folders in a folder structure

Question

I have the below function, that walksthe root of a given directory and grabs all subdirectories and places them into a list. This part works, sort of.

The objective is to determine the highest (largest number) numerically named folder. Assuming that the folder contains only numerically named folders, and does not contain alphanumeric folders of files, I'm good. However, if a file, or folder is present that is not numerically named I encounter issues because the script seems to be collecting all subdirectories and files, and loast everything into the list.

I need to just find those folders whose naming is numeric, and ignore anything else.

Example folder structure for c:\Test
\20200202\
\20200109\
\20190308\
\Apples\
\Oranges\
New Document.txt

This works to walk the directory but puts everything in the list, not just the numeric subfolders.

#Example code
import os 
from pprint import pprint 

files=[]
MAX_DEPTH = 1
folders = ['C:\\Test']
for stuff in folders:
    for root, dirs, files in os.walk(stuff, topdown=True):
        for subdirname in dirs:
            files.append(os.path.join(subdirname))
            #files.append(os.path.join(root, subdirname)) will give full directory
        #print("there are", len(files), "files in", root) will show counts of files per directory
        if root.count(os.sep) - stuff.count(os.sep) == MAX_DEPTH - 1:
            del dirs[:]
pprint(max(files))

Current Result of max(files): New Document.txt

Desired Output: 20200202

What I have tried so far:

I've tried catching each element before I add it to the list, seeing if the string of the subdirname can be converted to int, and then adding it to the list. This fails to convert the numeric subdirnames to an int, and somehow (I don't know how) the New Document.txt file gets added to the list.

files=[]
    MAX_DEPTH = 1
    folders = ['C:\\Test']
    for stuff in folders:
        for root, dirs, files in os.walk(stuff, topdown=True):
            for subdirname in dirs:
                try:
                    subdirname = int(subdirname)
                    print("Found subdir named " + subdirname + " type: " + type(subdirname))
                    files.append(os.path.join(subdirname))
                except:
                    print("Error converting " + str(subdirname) + " to integer")
                    pass
                #files.append(os.path.join(root, subdirname)) will give full directory
            #print("there are", len(files), "files in", root) will show counts of files per directory
            if root.count(os.sep) - stuff.count(os.sep) == MAX_DEPTH - 1:
                del dirs[:]
    return (input + "/" + max(files))

I've also tried appending everything to the list and then creating a second list (ie, without the try/except) using the below, but I wind up with an empty list. I'm not sure why, and I'm not sure where/how to start looking. Using 'type' on the list before applying the following shows that everything in the list is a str type.

list2 = [x for x in files if isinstance(x,int) and not isinstance(x,bool)]

anakaine · Accepted Answer · 2020-02-02T04:40:56.980

I'm going to go ahead and answer my own question here:

Changing the method entirely helped, and made it significantly faster, and simpler.

#the find_newest_date function looks for a folder with the largest number and assumes that is the newest data
def find_newest_date(input):
    intlistfolders = []
    list_subfolders_with_paths = [f.name for f in os.scandir(input) if f.is_dir()]
    for x in list_subfolders_with_paths:
        try:
            intval = int(x)
            intlistfolders.append(intval)
        except:
            pass
    return (input + "/" + str(max(intlistfolders)))

Explanation:

scandir is 3x faster than walk. directory performance
scandir also allows the use of f.name to pull out just the folder names, or f.path to get paths.

So, use scandir to load up the list with all the subdirs.

Iterate over the list, and try to convert each value to an integer. I don't know why it wouldn't work in the earlier example, but it works in this case.
The first part of the try statement converts to an integer.
If conversion fails, the except clause is run, and 'pass' is essentially a null statement. It does nothing.
Then, finally, join the input directory with the string representation of the maximum numeric value (ie most recently dated folder in this case).

The function is called with:

folder_named_path = find_newest_date("C:\\Test") or something similar.

Christopher Hoffman · Answer 2 · 2020-02-02T04:10:38.567

0

Try matching dirs with a regular expression.num = r”[0-9]+” is your regular expression. Something like re.findall(num,subdirname) returns to you a matching string that is one or more Numbers.

edited Feb 02 '20 at 04:10

answered Feb 02 '20 at 03:54

Christopher Hoffman

182
1
8

Thanks - I'm not great with regex, so I stepped it out a little further in the answer I supplied using an alternate method. – anakaine Feb 02 '20 at 04:43

Python: Identifying numerically names folders in a folder structure

2 Answers2