EDIT 1: As fizzybear pointed out it looks as though my memory usage is steadily increasing but I can't say why, any ideas would be greatly appreciated.
I'm running a script which uses the staticfg
library to generate a tonne of control flow graphs from python programs, approximately 150,000 programs. My code simply loops through every program's file location and generates a corresponding control flow graph.
From a frequently updated progress bar I can see that when the script begins running it easily generates around 1000 CFGs in a few seconds, but half an hour into running it can barely generate 100 CFGs within a minute.
In an attempt to sped things up I implemented multi threading using python's multiprocessing map()
function but this doesn't help enough.
Furthermore, the cpu utilization (for all cores) shoots up to around 80-90% at the beginning of the script but drops to around 30-40% after running for a few minutes.
I've tried running it on Windows 10 and Ubuntu 18.04 and both slow down to an almost unbearable speed.
Code for building control-flow-graph
from staticfg import CFGBuilder
def process_set():
content = get_file_paths()
iterate(build_cfg, ERROR_LOG_FILE, content)
def build_cfg(file_path):
cfg = CFGBuilder().build_from_file(os.path.basename(file_path), os.path.join(DATA_PATH, file_path))
cfg.build_visual(get_output_data_path(file_path), format='dot', calls=False, show=False)
os.remove(get_output_data_path(file_path)) # Delete the other weird file created
Code for running the cfg building
from threading import Lock
from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing
def iterate(task, error_file_path, content):
progress_bar = ProgressBar(0, content.__len__(), prefix='Progress:', suffix='Complete')
progress_bar.print_progress_bar()
error_file_lock = Lock()
increment_work_lock = Lock()
increment_errors_lock = Lock()
def an_iteration(file):
try:
task(file)
except Exception as e:
with increment_errors_lock:
progress_bar.increment_errors()
with error_file_lock:
handle_exception(error_file_path, file, 'Error in doing thing', e)
finally:
with increment_work_lock:
progress_bar.increment_work()
progress_bar.print_progress_bar()
pool = multiprocessing.dummy.Pool(multiprocessing.cpu_count())
pool.map(an_iteration, content)
Code for error handling
def handle_exception(error_log_file_path, file_path, message, stacktrace):
with open(error_log_file_path, 'a+', encoding='utf8') as f:
f.write('\r{},{},{},{}\n'.format(str(datetime.datetime.now()), message, file_path, stacktrace))
As far as I can tell (?) there is no object ever increasing in size and no increasing lookup time somewhere, so I'm a little lost as to why the script should be slowing down at all. Any help would be greatly appreciated.
I'm also pretty sure that it's not the contention for the locks that is slowing down the program as I was having this problem before I implemented multi threading, and contention should be pretty low anyway because the CFG building should take up a lot more time than updating the progress bar. Furthermore, errors aren't that frequent so writing to the error log doesn't happen too often, not enough to justify a lot of contention.
Cheers.
Edit 2: Code for progress bar in case that affects the memory usage
class ProgressBar:
def __init__(self, iteration, total, prefix='', suffix='', decimals=1, length=100, fill='█'):
self.iteration = iteration
self.total = total
self.prefix = prefix
self.suffix = suffix
self.decimals = decimals
self.length = length
self.fill = fill
self.errors = 0
def increment_work(self):
self.iteration += 1
def increment_errors(self):
self.errors += 1
def print_progress_bar(self):
percent = ("{0:." + str(self.decimals) + "f}").format(100 * (self.iteration / float(self.total)))
filled_length = int(self.length * self.iteration // self.total)
bar = self.fill * filled_length + '-' * (self.length - filled_length)
print('%s |%s| %s%% (%s/%s) %s, %s %s' % (self.prefix, bar, percent, self.iteration, self.total, self.suffix, str(self.errors), 'errors'), end='\r')
# Print New Line on Complete
if self.iteration == self.total:
print()