0

I have a huge directory with thousand+ files in it

I want to get the files created from 7:30PM to 7:30 AM and vice versa.

I have been using below code to do it but seems its getting slower as files increase. I am running it on Linux.

First I defined get_time function here:

def get_time():

    tmp_date = datetime.now()
    year = tmp_date.year
    month = tmp_date.month
    day = tmp_date.day  
    date_start = datetime(year, month, day, 7,30)
    date_end = datetime(year, month, day, 19,30)
    shift = "Day Shift"

    if (date_start < tmp_date) and (tmp_date > date_end):
        date_start = datetime(year, month, day, 19,30)
        date_end = datetime(year, month, day, 7,30) + timedelta(1)
        shift = "Night Shift"
        
    elif (date_start > tmp_date) and (tmp_date < date_end):
        date_start = datetime(year, month, day, 19,30) - timedelta(1)
        date_end = datetime(year, month, day, 7,30)
        shift = "Night Shift"
    
    return date_start, date_end, shift

and then

def get_qc_success(ROOT_FOLDER):

    date_start, date_end, shift = get_time()
    
    files = []

    ARCHIVE_FOLDER = os.path.join(ROOT_FOLDER,"LOMS","ARCHIVE")
    files = os.listdir(ARCHIVE_FOLDER)
    for csv in os.listdir(ARCHIVE_FOLDER):
        path = os.path.join(ARCHIVE_FOLDER,csv)
        filetime = datetime.fromtimestamp(
                os.path.getctime(path))
        if (date_start < filetime < date_end):
            files.append(csv)
    len_success = len(files)
            
    return files, len_success, shift

Is there any other methods to make it even faster ?

Rahman Haroon
  • 1,088
  • 2
  • 12
  • 36
Royal
  • 218
  • 2
  • 17

1 Answers1

0

What you can do instead of returning is yielding.

def get_qc_success(ROOT_FOLDER,date_start,date_end):

    date_start, date_end, shift = get_time()
    
    files = []#You can remove this.

    ARCHIVE_FOLDER = os.path.join(ROOT_FOLDER,"LOMS","ARCHIVE")
    files = os.listdir(ARCHIVE_FOLDER)
    for csv in os.listdir(ARCHIVE_FOLDER):
        path = os.path.join(ARCHIVE_FOLDER,csv)
        filetime = datetime.fromtimestamp(
                os.path.getctime(path))
        if (date_start < filetime < date_end):
            yield csv

You still need the len_success right? which is the len of files? You can also compute it based on the generator variable.

For me this is the best way, Why? Check here.

date_start, date_end, shift = get_time()
generator = get_qc_success("filepathsample/",date_start,date_end) #Take note this generator contains your data being yielded in this case the csv variable from your function, you just need to iterate over it.
len_files = sum(1 for x in generator)

In case for your get_time() function, I think it would be better off if you could put it at the top & just pass in the date_start & date_end variables. Cause as I can see in GENERAL you only want to get the list of files and you are just appending them in a list and returning that list. Well there's a better way to do that which is using the yield keyword.

You can check here about another question on yield.

Return vs Yield

Source here

When your function get_qc_success gets executed now it has a return keyword right well..

If you are still curious about yield and return take this example & have a test run and observe the output and flow of the program. You'll see the benefits you'll get if you would consider yield.

import time

def foo():
    data = []

    for i in range(10):
        data.append(i)
        print("Sleeping")
        time.sleep(2)


    return data




def foofoo():
    for i in range(10):
        yield i
        print("Sleeping")
        time.sleep(2)





#Run this first and observe the Output program & flow
for f in foo():
    print(f)



'''
#Run this second and observe the Output program & flow & comment out the first for loop for foo() above
for f in foofoo():
    print(f)
'''
Ice Bear
  • 2,676
  • 1
  • 8
  • 24