Original Question:
I am trying to recursively scan directories in order to get the occupied size of a disk and details for each file and folder using the code below. The code below works perfectly fine but I need some advice on improving its efficiency, in order to scan drives that have 200 GB or more in occupied space/data. The testing results for a disk with 5.49 GB(244,169 files and 34,253 folders) occupied space are as follow:
- If the code is run without the list append operation it takes roughly 8 mins to scan the disk which is not highly efficient
- It gets even worse if I include the list append statement, then it takes roughly 25 mins --> Bottleneck
import os
import sys
import logging
from os.path import *
def scanSlot(path):
"""Return total size of files in given path and subdirs."""
global dir_list
global path_list
try:
dir_list = os.scandir(path)
except Exception as e:
logging.info(">>> Access Denied for " + path)
dir_list = {}
tot_size = 0
for file in dir_list:
file_stat = file.stat()
time_stat = os.stat(file.path)
# If the given file object is a directory recursively call the function again
if file.is_dir(follow_symlinks=False):
#Recursive Call
bytes = scanSlot(file.path)
#Calculating Size
tot_size += bytes
# logging.info('Dirname:'+str(file)+' Path'+str(file.path) +' Size'+str(bytes))
# List Append
path_list.append((file.name,file.path,file_stat.st_mode,file_stat.st_ino,file_stat.st_dev,file_stat.st_nlink,file_stat.st_uid,file_stat.st_gid,tot_size,
datetime.fromtimestamp(time_stat.st_atime),datetime.fromtimestamp(time_stat.st_mtime),datetime.fromtimestamp(time_stat.st_ctime),"dir"))
# If the given file object is a file then retrieve all the details
if file.is_file(follow_symlinks=False):
# logging.info('Filename:'+str(file)+ ' Path'+str(file.path) +' Size'+str(time_stat))
tot_size += file.stat(follow_symlinks=False).st_size
# List Append
path_list.append((file.name,file.path,file_stat.st_mode,file_stat.st_ino,file_stat.st_dev,file_stat.st_nlink,file_stat.st_uid,file_stat.st_gid,file_stat.st_size,
datetime.fromtimestamp(time_stat.st_atime),datetime.fromtimestamp(time_stat.st_mtime),datetime.fromtimestamp(time_stat.st_ctime),"file"))
return tot_size
Function call for above code:
server_size = scanSlot('D:\\New folder')
I have tried to optimize the code using the following methods:
- Python library numba, this doesn't work since numba does not have an implementation for the os library being used.
- Trying to convert code to Cython but not really sure if that will help
The list append operation cannot be overlooked as the details in that path_list are required for further analysis.
Updates:
As per the suggestion from @triplee and with help of the implementation here I have implemented the directory scan using os.walk() and clearly it's way faster(19.6GB with 2,75,559 files 38,592 folder scanned in 20mins while doing an I/O to a log file for each file directory). The code is as follows:
FYI: STILL TESTING THIS
def scanSlot(path):
total_size = 0
global path_list
for dirpath, dirnames, filenames in os.walk(path):
for f in filenames:
fp = os.path.join(dirpath, f)
# skip if it is symbolic link
if not os.path.islink(fp):
# logging.info('Filename:'+str(fp)+'Size:'+str(os.path.getsize(fp)))
file_stat = os.stat(fp)
path_list.append((f,fp,file_stat.st_mode,file_stat.st_ino,file_stat.st_dev,file_stat.st_nlink,file_stat.st_uid,file_stat.st_gid,os.path.getsize(fp),
datetime.fromtimestamp(file_stat.st_atime),datetime.fromtimestamp(file_stat.st_mtime),datetime.fromtimestamp(file_stat.st_ctime),"file"))
total_size += os.path.getsize(fp)
return total_size
if __name__ == "__main__":
logging.info('>>> Start:' + str(datetime.now().time()))
# So basically run os.walk() for a drive and then do so for all directories present in it
for dirpath, dirnames, filenames in os.walk('F:\\'):
for f in dirnames:
fp = os.path.join(dirpath, f)
dir_size = scanSlot(fp)
file_stat = os.stat(fp)
logging.info('Dirname:'+str(fp)+'Size:'+str(dir_size))
path_list.append((f,fp,file_stat.st_mode,file_stat.st_ino,file_stat.st_dev,file_stat.st_nlink,file_stat.st_uid,dir_size,
datetime.fromtimestamp(file_stat.st_atime),datetime.fromtimestamp(file_stat.st_mtime),datetime.fromtimestamp(file_stat.st_ctime),"dir"))
logging.info('>>> End:' + str(datetime.now().time()))
Further Questions:
- How to make this even more efficient/faster, initially I had thought of multiprocessing as in parallelly running the scan for 3 drives under 3 different processes.
Refrences:
The explanation for follow_symlinks can also be found in the reference links above.