4

Original Question:

I am trying to recursively scan directories in order to get the occupied size of a disk and details for each file and folder using the code below. The code below works perfectly fine but I need some advice on improving its efficiency, in order to scan drives that have 200 GB or more in occupied space/data. The testing results for a disk with 5.49 GB(244,169 files and 34,253 folders) occupied space are as follow:

  1. If the code is run without the list append operation it takes roughly 8 mins to scan the disk which is not highly efficient
  2. It gets even worse if I include the list append statement, then it takes roughly 25 mins --> Bottleneck
import os
import sys
import logging
from os.path import *

def scanSlot(path):
    """Return total size of files in given path and subdirs."""
    global dir_list
    global path_list
    try:
        dir_list = os.scandir(path)
    except Exception as e:
        logging.info(">>> Access Denied for " + path)
        dir_list = {}

    tot_size = 0 
    for file in dir_list:
        file_stat = file.stat()
        time_stat = os.stat(file.path)

            # If the given file object is a directory recursively call the function again
            if file.is_dir(follow_symlinks=False):

                #Recursive Call
                bytes = scanSlot(file.path)

                #Calculating Size
                tot_size += bytes

                # logging.info('Dirname:'+str(file)+' Path'+str(file.path) +' Size'+str(bytes))
         
                # List Append
                path_list.append((file.name,file.path,file_stat.st_mode,file_stat.st_ino,file_stat.st_dev,file_stat.st_nlink,file_stat.st_uid,file_stat.st_gid,tot_size,
                                       datetime.fromtimestamp(time_stat.st_atime),datetime.fromtimestamp(time_stat.st_mtime),datetime.fromtimestamp(time_stat.st_ctime),"dir"))

            # If the given file object is a file then retrieve all the details
            if file.is_file(follow_symlinks=False):
               
                # logging.info('Filename:'+str(file)+ ' Path'+str(file.path) +' Size'+str(time_stat))
    
                tot_size += file.stat(follow_symlinks=False).st_size

                # List Append
                path_list.append((file.name,file.path,file_stat.st_mode,file_stat.st_ino,file_stat.st_dev,file_stat.st_nlink,file_stat.st_uid,file_stat.st_gid,file_stat.st_size,
                                      datetime.fromtimestamp(time_stat.st_atime),datetime.fromtimestamp(time_stat.st_mtime),datetime.fromtimestamp(time_stat.st_ctime),"file"))
    return tot_size

Function call for above code:

server_size = scanSlot('D:\\New folder')

I have tried to optimize the code using the following methods:

  1. Python library numba, this doesn't work since numba does not have an implementation for the os library being used.
  2. Trying to convert code to Cython but not really sure if that will help

The list append operation cannot be overlooked as the details in that path_list are required for further analysis.

Updates:

As per the suggestion from @triplee and with help of the implementation here I have implemented the directory scan using os.walk() and clearly it's way faster(19.6GB with 2,75,559 files 38,592 folder scanned in 20mins while doing an I/O to a log file for each file directory). The code is as follows:

FYI: STILL TESTING THIS

def scanSlot(path):
    total_size = 0
    global path_list
    for dirpath, dirnames, filenames in os.walk(path):
                
        for f in filenames:
            fp = os.path.join(dirpath, f)
            # skip if it is symbolic link
            if not os.path.islink(fp):
                # logging.info('Filename:'+str(fp)+'Size:'+str(os.path.getsize(fp)))
                file_stat = os.stat(fp)
                path_list.append((f,fp,file_stat.st_mode,file_stat.st_ino,file_stat.st_dev,file_stat.st_nlink,file_stat.st_uid,file_stat.st_gid,os.path.getsize(fp),
                                       datetime.fromtimestamp(file_stat.st_atime),datetime.fromtimestamp(file_stat.st_mtime),datetime.fromtimestamp(file_stat.st_ctime),"file"))
                total_size += os.path.getsize(fp)


    return total_size

if __name__ == "__main__":
    logging.info('>>> Start:' + str(datetime.now().time()))

    # So basically run os.walk() for a drive and then do so for all directories present in it

    for dirpath, dirnames, filenames in os.walk('F:\\'):

        for f in dirnames:
            fp = os.path.join(dirpath, f)
            dir_size = scanSlot(fp)
            file_stat = os.stat(fp)
            logging.info('Dirname:'+str(fp)+'Size:'+str(dir_size))
            path_list.append((f,fp,file_stat.st_mode,file_stat.st_ino,file_stat.st_dev,file_stat.st_nlink,file_stat.st_uid,dir_size,
                                       datetime.fromtimestamp(file_stat.st_atime),datetime.fromtimestamp(file_stat.st_mtime),datetime.fromtimestamp(file_stat.st_ctime),"dir"))
           
           
    logging.info('>>> End:' + str(datetime.now().time()))

Further Questions:

  • How to make this even more efficient/faster, initially I had thought of multiprocessing as in parallelly running the scan for 3 drives under 3 different processes.

Refrences:

The explanation for follow_symlinks can also be found in the reference links above.

FlukeKing
  • 53
  • 6

0 Answers0