3

I am currently successfully using a python 2.7 script, which recursively cycles over a huge directory/file path, collects the paths of all files, gets the mtime of such files and the mtime of respective files having the same path and name, but being pdf files for comparison. I use scandir.walk() in the python 2.7 script and os.walk() in python 3.7, which has been recently updated to also use the scandir algorithm (no additional stat() calls).

However, still the python 3 version of the script is significantly slower! This is not due to the scandir/walk part of the algorithm, but apparently either due to the getmtime algorithm (which, however, is the same call in python2 and 3) or by the processing of the huge list (we are talking about ~500.000 entries in this list).

Any idea what might cause this and how to solve this issue?

#!/usr/bin/env python3
#
# Imports
#
import sys
import time
from datetime import datetime
import os
import re

#
# MAIN THREAD
#

if __name__ == '__main__':

    source_dir = '/path_to_data/'

    # Get file list
    files_list = []
    for root, directories, filenames in os.walk(source_dir):
        # Filter for extension
        for filename in filenames:
            if (filename.lower().endswith(('.msg', '.doc', '.docx', '.xls', '.xlsx'))) and (not filename.lower().startswith('~')):
                files_list.append(os.path.join(root, filename))

    # Sort list
    files_list.sort(reverse=True)

    # For each file, the printing routine is performed (including necessity check)
    all_documents_counter = len(files_list)
    for docfile_abs in files_list:

        print('\n' + docfile_abs)

        # Define files
        filepathname_abs, file_extension = os.path.splitext(docfile_abs)
        filepath_abs, filename = os.path.split(filepathname_abs)

        # If the filename does not have the format # # # # # # # *.xxx (e.g. seven numbers), then it is checked whether it is referenced in the databse. If not, it is moved to a certain directory
        if (re.match(r'[0-9][0-9][0-9][0-9][0-9][0-9][0-9](([Aa][0-9][0-9]?)?|(_[0-9][0-9]?)?|([Aa][0-9][0-9]?_[0-9][0-9]?)?)\...?.?', filename + file_extension) is None):
            if any(expression in docfile_abs for expression in ignore_subdirs):
                pass
            else:
                print('Not in database')

        # DOC
        docfile_rel = docfile_abs.replace(source_dir, '')

        # Check pdf
        try:
            pdf_file_abs = filepathname_abs + '.pdf'
            pdf_file_timestamp = os.path.getmtime(pdf_file_abs)
            check_pdf = True
        except(FileNotFoundError):
            check_pdf = False
        # Check PDF
        try:
            PDF_file_abs = filepathname_abs + '.PDF'
            PDF_file_timestamp = os.path.getmtime(PDF_file_abs)
            check_PDF = True
        except(FileNotFoundError):
            check_PDF = False

        # Check whether ther are lowercase or uppercase extension and decide what to do if there are none, just one or both present
        if (check_pdf is True) and (check_PDF is False):
            # Lower case case
            pdf_extension = '.pdf'
            pdffile_timestamp = pdf_file_timestamp
        elif (check_pdf is False) and (check_PDF is True):
            # Upper case case
            pdf_extension = '.PDF'
            pdffile_timestamp = PDF_file_timestamp
        elif (check_pdf is False) and (check_PDF is False):
            # None -> set timestampt to zero
            pdf_extension = '.pdf'
            pdffile_timestamp = 0
        elif (check_pdf is True) and (check_PDF is True):
            # Both are present, decide for the newest and move the other to a directory
            if (pdf_file_timestamp < PDF_file_timestamp):
                pdf_extension = '.PDF'
                pdf_file_rel = pdf_file_abs.replace(source_dir, '')
                pdffile_timestamp = PDF_file_timestamp
            elif (PDF_file_timestamp < pdf_file_timestamp):
                pdf_extension = '.pdf'
                PDF_file_rel = PDF_file_abs.replace(source_dir, '')
                pdffile_timestamp = pdf_file_timestamp

        # Get timestamps of doc and pdf files
        try:
            docfile_timestamp = os.path.getmtime(docfile_abs)
        except OSError:
            docfile_timestamp = 0

        # Enable this to force a certain period to be printed
        DateBegin = time.mktime(time.strptime('01/02/2017', "%d/%m/%Y"))
        DateEnd = time.mktime(time.strptime('01/03/2017', "%d/%m/%Y"))

        # Compare stimestamps and print or not
        if (pdffile_timestamp < docfile_timestamp) or (pdffile_timestamp == 0):

            # Inform that there should be printed
            print('\tPDF should be printe.')

        else:
            # Inform that there was no need to print
            print('\tPDF is up to date.')


    # Exit
    sys.exit(0)
marlemion
  • 43
  • 1
  • 6
  • 1
    are you using windows? – Jean-François Fabre Feb 22 '19 at 14:26
  • you need a recursive scandir: https://stackoverflow.com/questions/33135038/how-do-i-use-os-scandir-to-return-direntry-objects-recursively-on-a-directory – Jean-François Fabre Feb 22 '19 at 14:28
  • 1
    One major difference in Python 3, is that it tends to return generators where equivalent Python 2 function would return lists. This is typically a good thing for performance, but perhaps your code is inadvertently misusing the generators by assuming they're lists. A code sample would be useful. – Nikolas Stevenson-Molnar Feb 22 '19 at 14:53
  • Thanks fot the comments. I have added a stripped down code of my script. I am on Linux, though. – marlemion Feb 25 '19 at 08:36

1 Answers1

4

Not sure what explains the difference, but even if os.walk has been enhanced to use scandir, it doesn't extend to further getmtime calls, which access the file attributes one more time.

The ultimate goal being to not call os.path.getmtime at all.

The speedup in os.walk is about not performing stat twice to know whether the object is a directory or a file. But the internal DirEntry object (that scandir yields) is never made public so you cannot reuse it to check the file time.

If you don't need recusion, it's done with os.scandir:

for dir_entry in os.scandir(r"D:\some_path"):
    print(dir_entry.is_dir())  # test for directory
    print(dir_entry.stat())    # returns stat object with date and all

Those calls within the loop are done at zero cost, because the DirEntry object already caches this information.

So to save the getmtime call you have to get hold of DirEntry objects recursively.

There's no native method for this, but there are recipies, example here: How do I use os.scandir() to return DirEntry objects recursively on a directory tree?

By doing this your code will be faster in python 2 and in python 3 because there will be only 1 stat call per object, not 2.

EDIT: after your edit to show the code, it appears that you're building pdf names from other entries, so you cannot rely on the DirEntry structure to get time, and not even sure that the file exist (also if you're using windows, no need to test pdf and PDF since filenames aren't case sensitive).

The best strategy would be to build a big database of files with associated times and all (using a dictionary) then scan it. I've successfully used this method to find old/big files on a 35-million file slow, networked drive. In my personal example, scanned the files once, and dumped the result in a big csv file (took several hours, for 6Gb of csv data), then further post-processing loaded the database and performed various tasks (much faster since no disk accesses were involved)

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • Ah, I was thinking already into this direction and also did some code optimization to reduce the os calls. However, it seems that the list processing gets slower. Nevertheless, I will implement the recursive (that is what I need) scandir function. Thanks a lot! – marlemion Feb 25 '19 at 08:37
  • 1
    you also have to be careful not to use `in` with lists. It's not shown here, but it's an easy mistake to make, and with big lists, it's slow as hell. Using a `set` is recommended to test belongship. – Jean-François Fabre Feb 25 '19 at 08:39
  • As I would like to have the files in the list to be processed in ordered mode, I cannot use a set, right? It is unordered. Furthermore, the only processing I am doing with the list is to consecutively process each item. Hence, I do not see the origin for the slowness in the list. – marlemion Feb 25 '19 at 09:29
  • see my edit. I saw you added the code. I can only advise you to rethink your strategy completely – Jean-François Fabre Feb 25 '19 at 10:08