I have a Python function that is running very slowly, and I am looking for suggestions on how to optimize it. The function compares two Genotype objects and returns a dictionary of sample genotypes. The function's runtime may be affected by the size of the Genotype objects and the length of the sample_list. Here is the current implementation of the function:
def compare_gt(gt1: Genotype, gt2: Genotype, sample_list: list = None):
# is gt1 is None
if not gt1 is None and not gt1.isValid():
gt1=None
if not gt2 is None and not gt2.isValid():
gt2=None
if gt1 is None and gt2 is None:
return None
if gt1 is None:
if sample_list is None:
sample_list = gt2.get_sample_list()
return {sample: "6" + gt2.get_sample_genotypes(sample_list)[sample] for sample in sample_list}
if gt2 is None:
if sample_list is None:
sample_list = gt1.get_sample_list()
return {sample: gt1.get_sample_genotypes(sample_list)[sample] + "6" for sample in sample_list}
# compare two genotype
if sample_list is None:
sample_list = gt1.get_sample_list()
return {sample: gt1.get_sample_genotypes(sample_list)[sample] + gt2.get_sample_genotypes(sample_list)[sample] for sample in sample_list}
I have already tried a few optimization techniques, such as avoiding duplicate calculations of sample_list and storing the results of gt1 and gt2 in variables to avoid repeated method calls. However, the function is still running very slowly.
I would appreciate any suggestions on how to optimize this function further. Thank you in advance for your help
=======update========
As @J_H suggest, I would like to update some pseudocode to illustrate how the program works
class gt_reader():
def __init__(self, ...):
# do someting
def update(self):
# read next genotype line
def get_sample_genotypes(self, sample_list) -> dict:
# do someting
return sample_gt
# other
gt1 = gt_reader(...)
gt2 = gt_reader(...)
while gt1.update():
# some code
while gt2.update():
# some code
compare_gt(gt1, gt2)
# some code
if x:
break
In addition to the suggestions provided by the participants, I also implemented some additional optimizations to the program. For example, I ensured that the output of get_sample_genotypes()
is always sorted by its query, so I don't have to keep using indexing to retrieve the return values if both gt1
and gt2
sample is same, which improves performance in a large number of comparison operations