I am using genomic tree sequences in a project (See tree sequence toolkit and Relate inference). My tree sequence contains about 1 million trees and each of which contains 884 samples. To summarise the aim; I iterate through the trees consecutively and examine every sample in a function, outputting an integer value in the range 0-8 for every sample in every tree. Trees change very little from one to the next consecutively so I actually only look at the samples for which changes have occurred and results from the previous tree are copied for those samples where no changes occurred. The outer loop over the trees looks like this:
predict=np.zeros((num_trees,num_samples), dtype=int)
progress_bar = tqdm.tqdm(total=num_trees)
results_prev=np.zeros((num_samples,(9*4)))
index=0
for diff in tree_sequence.edge_diffs():
tree=ts.at_index(index)
results = sam_sites(diff[2],tree, results_prev)
predict[index,:]=results
results_prev=results
del results
progress_bar.update()
index+=1
progress_bar.close()
While the sam_sites function looks like this:
def sam_sites(diff, tree, results_prev):
nodes={diff[i].child for i in range(len(diff))}
samples_changed={leaf for node in nodes for leaf in tree.leaves(node)}
results_new=results_prev
results_new[np.array(list(samples_changed)),:]=0
for sam in sams_new:
"Extract a integer value for each of samples_changed and replace 0 in results_new array"
return results_new
When I fist run the code it works pretty quickly, processing about 40 trees per second. But as the iterations continue the number of trees processed per second falls off. I've has a script running for 24 hours and the processing speed has fallen to 2 trees per second.
From other searches, I thought this could be a memory problem? Perhaps adding to the predict numpy array with each iteration increases the memory used and slows down the script?