3

I am using genomic tree sequences in a project (See tree sequence toolkit and Relate inference). My tree sequence contains about 1 million trees and each of which contains 884 samples. To summarise the aim; I iterate through the trees consecutively and examine every sample in a function, outputting an integer value in the range 0-8 for every sample in every tree. Trees change very little from one to the next consecutively so I actually only look at the samples for which changes have occurred and results from the previous tree are copied for those samples where no changes occurred. The outer loop over the trees looks like this:

    predict=np.zeros((num_trees,num_samples), dtype=int)
    progress_bar = tqdm.tqdm(total=num_trees)
    results_prev=np.zeros((num_samples,(9*4)))
    index=0
    for diff in tree_sequence.edge_diffs():
            tree=ts.at_index(index)
            results = sam_sites(diff[2],tree, results_prev)
            predict[index,:]=results
            results_prev=results
            del results
            progress_bar.update()
            index+=1
    progress_bar.close()

While the sam_sites function looks like this:

    def sam_sites(diff, tree, results_prev):
            nodes={diff[i].child for i in range(len(diff))}
            samples_changed={leaf for node in nodes for leaf in tree.leaves(node)}

            results_new=results_prev
            results_new[np.array(list(samples_changed)),:]=0

            for sam in sams_new:
                    "Extract a integer value for each of samples_changed and replace 0 in results_new array"
            return results_new

When I fist run the code it works pretty quickly, processing about 40 trees per second. But as the iterations continue the number of trees processed per second falls off. I've has a script running for 24 hours and the processing speed has fallen to 2 trees per second.

From other searches, I thought this could be a memory problem? Perhaps adding to the predict numpy array with each iteration increases the memory used and slows down the script?

alpal
  • 31
  • 2

1 Answers1

0

Solved it! For some reason the line...

    tree=ts.at_index(index)

to extract the right tree with each iteration was the cause of the slow down. I replaced it with the .next() function instead. The code is now:

    predict=np.zeros((num_trees,num_samples), dtype=int)
    progress_bar = tqdm.tqdm(total=num_trees)
    results_prev=np.zeros((num_samples,(9*4)))
    tree = tskit.Tree(ts)
    index=0
    for diff in tree_sequence.edge_diffs():
            tree.next()
            results = sam_sites(diff[2],tree, results_prev)
            predict[index,:]=results
            results_prev=results
            del results
            index+=1
            progress_bar.update()
    progress_bar.close()

This works consistently at about 30 trees per second.

alpal
  • 31
  • 2