0

I have a Python function that is running very slowly, and I am looking for suggestions on how to optimize it. The function compares two Genotype objects and returns a dictionary of sample genotypes. The function's runtime may be affected by the size of the Genotype objects and the length of the sample_list. Here is the current implementation of the function:

def compare_gt(gt1: Genotype, gt2: Genotype, sample_list: list = None):
    # is gt1 is None
    if not gt1 is None and not gt1.isValid(): 
        gt1=None
    if not gt2 is None and not gt2.isValid():
        gt2=None
    if gt1 is None and gt2 is None:
        return None
    if gt1 is None:
        if sample_list is None:
            sample_list = gt2.get_sample_list()
        return {sample: "6" + gt2.get_sample_genotypes(sample_list)[sample] for sample in sample_list}
    if gt2 is None: 
        if sample_list is None:
            sample_list = gt1.get_sample_list()
        return {sample: gt1.get_sample_genotypes(sample_list)[sample] + "6" for sample in sample_list}
    # compare two genotype
    if sample_list is None:
        sample_list = gt1.get_sample_list()
    return {sample: gt1.get_sample_genotypes(sample_list)[sample] + gt2.get_sample_genotypes(sample_list)[sample] for sample in sample_list}

I have already tried a few optimization techniques, such as avoiding duplicate calculations of sample_list and storing the results of gt1 and gt2 in variables to avoid repeated method calls. However, the function is still running very slowly.

I would appreciate any suggestions on how to optimize this function further. Thank you in advance for your help

=======update========

As @J_H suggest, I would like to update some pseudocode to illustrate how the program works

class gt_reader(): 
    def __init__(self, ...):
        # do someting
    def update(self):
        # read next genotype line
    def get_sample_genotypes(self, sample_list) -> dict:
        # do someting 
        return sample_gt
    # other

    gt1 = gt_reader(...)
    gt2 = gt_reader(...)

while gt1.update(): 
     # some code
    while gt2.update(): 
     # some code
        compare_gt(gt1, gt2)
        # some code
        if x: 
             break

In addition to the suggestions provided by the participants, I also implemented some additional optimizations to the program. For example, I ensured that the output of get_sample_genotypes() is always sorted by its query, so I don't have to keep using indexing to retrieve the return values if both gt1 and gt2 sample is same, which improves performance in a large number of comparison operations

zhang
  • 185
  • 7
  • first you need to find where it is taking most time. use any profiler or add print statements to find that. – Albin Paul Jun 19 '23 at 02:50
  • I have tried `cProfile`, and the output show that `compare_gt()` use about 1/3 time of all, and the most is `` – zhang Jun 19 '23 at 02:52
  • yeah, I have tried this, Its seem can save 5~8 / 15 s of compare_gt() run time on my test data (will call about 0.1 million times) , the `compare_gt()` will be call about 150 million times in my real data, so I hope to optimize the time it executes in as much as possible – zhang Jun 19 '23 at 03:06
  • If there are no other code improvements. you can try to speed it up by multiprocessing, the creation of dictionary. Maybe helpful ? https://stackoverflow.com/questions/6832554/multiprocessing-how-do-i-share-a-dict-among-multiple-processes. – Albin Paul Jun 19 '23 at 03:24
  • Thanks for you kindly help, In fact, I have want to use multiprocessing to speed up this program. However, the trouble is that my program's input is two very large gzip-compressed tab files, which contains a column of positions (sorted numerical values). Essentially, this program compares some information one position at a time. But because the positions in the two files are not identical, and the program must handle the inconsistent pos (one pos only exist in one file), it seems difficult to reliably split the task. In simple terms – zhang Jun 19 '23 at 03:42
  • so, I must split at the locations where the two files are consistent, but the basic principle of the program is to process the data line by line. – zhang Jun 19 '23 at 03:42
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254140/discussion-between-zhang-and-albin-paul). – zhang Jun 19 '23 at 03:42
  • You showed us a small part of your program which does 1000x less looping than the outer part. Show us that part. And revealing genotype details would help, as there's probably a better way to iterate over samples. Also, for scalable algorithm design, often an [external](https://en.wikipedia.org/wiki/External_sorting) merge sort will win. It's what RDBMS optimizers do for a JOIN. You likely compute many things which you then discard, and for many use cases merging input streams can help to avoid that. To make the _current_ algorithm go fast, switch to Rust. – J_H Jun 19 '23 at 14:55
  • Hi, @J_H, thanks for your advices, I have update it – zhang Jun 28 '23 at 08:56
  • Given the nested loops, we see that `gt1` is constant for long periods, while `gt2` is rapidly changing. This suggests [hoisting](https://en.wikipedia.org/wiki/Loop-invariant_code_motion) many `gt1` checks out of the `compare_gt` function, perhaps through use of [partial](https://docs.python.org/3/library/functools.html#functools.partial) to specialize for a given `gt1`. There is overhead to probing an [LRU cache](https://docs.python.org/3/library/functools.html#functools.lru_cache), but for "expensive" calculations it's worth it. Still, merge sort will offer the biggest algorithmic win. – J_H Jun 28 '23 at 13:05

0 Answers0