What is the fastest way to get list of filenames created in certain time period

Question

I have a huge directory with thousand+ files in it

I want to get the files created from 7:30PM to 7:30 AM and vice versa.

I have been using below code to do it but seems its getting slower as files increase. I am running it on Linux.

First I defined get_time function here:

def get_time():

    tmp_date = datetime.now()
    year = tmp_date.year
    month = tmp_date.month
    day = tmp_date.day  
    date_start = datetime(year, month, day, 7,30)
    date_end = datetime(year, month, day, 19,30)
    shift = "Day Shift"

    if (date_start < tmp_date) and (tmp_date > date_end):
        date_start = datetime(year, month, day, 19,30)
        date_end = datetime(year, month, day, 7,30) + timedelta(1)
        shift = "Night Shift"
        
    elif (date_start > tmp_date) and (tmp_date < date_end):
        date_start = datetime(year, month, day, 19,30) - timedelta(1)
        date_end = datetime(year, month, day, 7,30)
        shift = "Night Shift"
    
    return date_start, date_end, shift

and then

def get_qc_success(ROOT_FOLDER):

    date_start, date_end, shift = get_time()
    
    files = []

    ARCHIVE_FOLDER = os.path.join(ROOT_FOLDER,"LOMS","ARCHIVE")
    files = os.listdir(ARCHIVE_FOLDER)
    for csv in os.listdir(ARCHIVE_FOLDER):
        path = os.path.join(ARCHIVE_FOLDER,csv)
        filetime = datetime.fromtimestamp(
                os.path.getctime(path))
        if (date_start < filetime < date_end):
            files.append(csv)
    len_success = len(files)
            
    return files, len_success, shift

Is there any other methods to make it even faster ?

Ice Bear · Answer 1 · 2020-12-19T06:20:57.097

What you can do instead of returning is yielding.

def get_qc_success(ROOT_FOLDER,date_start,date_end):

    date_start, date_end, shift = get_time()
    
    files = []#You can remove this.

    ARCHIVE_FOLDER = os.path.join(ROOT_FOLDER,"LOMS","ARCHIVE")
    files = os.listdir(ARCHIVE_FOLDER)
    for csv in os.listdir(ARCHIVE_FOLDER):
        path = os.path.join(ARCHIVE_FOLDER,csv)
        filetime = datetime.fromtimestamp(
                os.path.getctime(path))
        if (date_start < filetime < date_end):
            yield csv

You still need the len_success right? which is the len of files? You can also compute it based on the generator variable.

For me this is the best way, Why? Check here.

date_start, date_end, shift = get_time()
generator = get_qc_success("filepathsample/",date_start,date_end) #Take note this generator contains your data being yielded in this case the csv variable from your function, you just need to iterate over it.
len_files = sum(1 for x in generator)

In case for your get_time() function, I think it would be better off if you could put it at the top & just pass in the date_start & date_end variables. Cause as I can see in GENERAL you only want to get the list of files and you are just appending them in a list and returning that list. Well there's a better way to do that which is using the yield keyword.

You can check here about another question on yield.

Return vs Yield

Source here

If you are still curious about yield and return take this example & have a test run and observe the output and flow of the program. You'll see the benefits you'll get if you would consider yield.

import time

def foo():
    data = []

    for i in range(10):
        data.append(i)
        print("Sleeping")
        time.sleep(2)


    return data




def foofoo():
    for i in range(10):
        yield i
        print("Sleeping")
        time.sleep(2)





#Run this first and observe the Output program & flow
for f in foo():
    print(f)



'''
#Run this second and observe the Output program & flow & comment out the first for loop for foo() above
for f in foofoo():
    print(f)
'''

What is the fastest way to get list of filenames created in certain time period

1 Answers1

Return vs Yield