Python data generation script slows down with time

Question

EDIT 1: As fizzybear pointed out it looks as though my memory usage is steadily increasing but I can't say why, any ideas would be greatly appreciated.

I'm running a script which uses the staticfg library to generate a tonne of control flow graphs from python programs, approximately 150,000 programs. My code simply loops through every program's file location and generates a corresponding control flow graph.

From a frequently updated progress bar I can see that when the script begins running it easily generates around 1000 CFGs in a few seconds, but half an hour into running it can barely generate 100 CFGs within a minute.

In an attempt to sped things up I implemented multi threading using python's multiprocessing map() function but this doesn't help enough.

Furthermore, the cpu utilization (for all cores) shoots up to around 80-90% at the beginning of the script but drops to around 30-40% after running for a few minutes.

I've tried running it on Windows 10 and Ubuntu 18.04 and both slow down to an almost unbearable speed.

Code for building control-flow-graph

from staticfg import CFGBuilder

def process_set():
    content = get_file_paths()
    iterate(build_cfg, ERROR_LOG_FILE, content)


def build_cfg(file_path):
    cfg = CFGBuilder().build_from_file(os.path.basename(file_path), os.path.join(DATA_PATH, file_path))
    cfg.build_visual(get_output_data_path(file_path), format='dot', calls=False, show=False)
    os.remove(get_output_data_path(file_path))  # Delete the other weird file created

Code for running the cfg building

from threading import Lock
from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing

def iterate(task, error_file_path, content):
    progress_bar = ProgressBar(0, content.__len__(), prefix='Progress:', suffix='Complete')
    progress_bar.print_progress_bar()

    error_file_lock = Lock()
    increment_work_lock = Lock()
    increment_errors_lock = Lock()

    def an_iteration(file):
        try:
            task(file)
        except Exception as e:
            with increment_errors_lock:
                progress_bar.increment_errors()
            with error_file_lock:
                handle_exception(error_file_path, file, 'Error in doing thing', e)
        finally:
            with increment_work_lock:
                progress_bar.increment_work()
                progress_bar.print_progress_bar()

    pool = multiprocessing.dummy.Pool(multiprocessing.cpu_count())
    pool.map(an_iteration, content)

Code for error handling

def handle_exception(error_log_file_path, file_path, message, stacktrace):
    with open(error_log_file_path, 'a+', encoding='utf8') as f:
        f.write('\r{},{},{},{}\n'.format(str(datetime.datetime.now()), message, file_path, stacktrace))

As far as I can tell (?) there is no object ever increasing in size and no increasing lookup time somewhere, so I'm a little lost as to why the script should be slowing down at all. Any help would be greatly appreciated.

I'm also pretty sure that it's not the contention for the locks that is slowing down the program as I was having this problem before I implemented multi threading, and contention should be pretty low anyway because the CFG building should take up a lot more time than updating the progress bar. Furthermore, errors aren't that frequent so writing to the error log doesn't happen too often, not enough to justify a lot of contention.

Cheers.

Edit 2: Code for progress bar in case that affects the memory usage

class ProgressBar:
    def __init__(self, iteration, total, prefix='', suffix='', decimals=1, length=100, fill='█'):
        self.iteration = iteration
        self.total = total
        self.prefix = prefix
        self.suffix = suffix
        self.decimals = decimals
        self.length = length
        self.fill = fill
        self.errors = 0

    def increment_work(self):
        self.iteration += 1

    def increment_errors(self):
        self.errors += 1

    def print_progress_bar(self):
        percent = ("{0:." + str(self.decimals) + "f}").format(100 * (self.iteration / float(self.total)))
        filled_length = int(self.length * self.iteration // self.total)
        bar = self.fill * filled_length + '-' * (self.length - filled_length)
        print('%s |%s| %s%% (%s/%s) %s, %s %s' % (self.prefix, bar, percent, self.iteration, self.total, self.suffix, str(self.errors), 'errors'), end='\r')
        # Print New Line on Complete
        if self.iteration == self.total:
            print()

What is `task`? Did you try checking if your memory usage is constant? — fizzybear, Jul 07 '19 at 04:36
@fizzybear now that you mention memory I do see that it's been increasing steadily. Any idea on what it could be? ```task``` is the function that is passed to ```iterate()``` so in this case ```task``` will be the ```build_cfg()``` function shown in the first code snippet. — Buster Darragh-Major, Jul 07 '19 at 04:44
It's difficult to say without looking at the full code. I would first check if your system is paging or if the memory is increasing but staying below RAM limits. The issue could also be related to `multiprocessing.dummy`. Are you appending to a list or something inside every worker? — fizzybear, Jul 07 '19 at 04:56
Is there a reason for `multiprocessing.dummy`? It seems like this is a simple parallel map job that would actually benefit from the normal `multiprocessing`. — fizzybear, Jul 07 '19 at 04:59
Im not too familiar with pythins mulithreading but I'll have a look into just using ```multiprocessing``` when I can, although I'm unsure that the issue would be related to the multithreading because I had this ussue before I introduced any multithreading. I will look into the memory paging when I get home! — Buster Darragh-Major, Jul 07 '19 at 05:05
You could look at this to see if there is a memory leak somehow. https://stackoverflow.com/questions/1435415/python-memory-leaks. Random thought: are your files ordered in some way? Maybe more complex files naturally take more compute. — fizzybear, Jul 07 '19 at 05:11
Had a quick look at `GraphViz`, which is what `staticfg` uses for visualization. Seems like their default rendering engine is known for being slow on large inputs (https://stackoverflow.com/questions/10766100/graphviz-dot-very-long-duration-of-generation). If one of your files is like 100x larger than typical, it could make everything crawl. — fizzybear, Jul 07 '19 at 05:30

Python data generation script slows down with time

0 Answers0