6


I am implementing an algorithm in Python using Biopython. I have several alignments (sets of sequences of equal length) stored in FASTA files. Each alignment contains between 500 and 30000 seqs and each sequence is about 17000 elements long. Each sequence is stored as a Bio.SeqRecord.SeqRecord object (check the SeqRecord object's API documentation for more information) which holds not only the sequence but also some information about it. I read it from disk using Bio.AlignIO.read() (check the AlignIO module's API documentation for more information) which returns a MultipleSeqAlignment object:

seqs = AlignIO.read(seqs_filename, 'fasta')
len_seqs = seqs.get_alignment_length()
stats = {'-': [0.0] * len_seqs, 'A': [0.0] * len_seqs,
         'G': [0.0] * len_seqs, 'C': [0.0] * len_seqs,
         'T': [0.0] * len_seqs}


I include this sketch for clarity purposes: enter image description here


Because I want to paralelize the analysis of the alignment, I asign to each available cpu a fragment of it using the threading module (more details about why I made this decision later):

num_cpus = cpu_count()
num_columns = ceil(len_seqs / float(num_cpus))
start_column = 0
threads = []
for cpu in range(0, num_cpus):
    section = (start_column, start_column + num_columns)
    threads.append(CI_Thread(seqs_type, seqs, section, stats))
    threads[cpu].start()

    start_column += num_columns

The stats variable is a shared one in which I store some information about the analysis (as you can see in the first code snippet). Because each cpu is going to modify different positions of this shared variable I think there isn't any need of access control nor any synchronization primitive. The main reason I am using threads instead of processes it's because I want to avoid copying the whole MultipleSeqAlignment object for each cpu. I've done some research and found some stackoverflow posts about this.

I've also read some information about the "dreaded" Python Global Interpreter Lock (GIL) (I found great info both at stackoverflow and programmers at stack exchange) but I'm still not 100% sure if my algorithm is affected by it. As far as I know I'm loading sequences into memory one at a time, thus doing IO on every iteration. This is why I think it's a good idea to use threads as was stated in this stackoverflow post I've already mentioned before. The basic structure of the analysis (which each thread is executing) looks something like this:

for seq in seqs:
    num_column = start_column
    for column in seq.seq[start_column:end_column].upper():
        # [...]
            try:
                stats[elem][num_column] = get_some_info(column)
            except TypeError:
                raise TypeError(('"stats" argument should be a '
                                 'dict of lists of int'))

I have done some performance tests using the timeit module and the time command using the arguments -f "%e %M" to check not only the elapsed real time (in seconds) but also the maximum resident set size of the process during its lifetime (in Kbytes). It seems that the execution time using threads is the execution time of the sequential implementation divided by the number of threads. For the maximum resident set size I'm still unable to find a pattern.


If you have any other suggestions about performance or clarity I would much appreciate them.


Thanks in advance.

Community
  • 1
  • 1
  • 1
    As far your reputations were considered, the best thing I could do was to upvote your question , now you have 10+ reputations I guess, But sorry i cant help you with question – ZdaR Dec 29 '14 at 13:33
  • Thanks @Anmol_uppal! I've just edited the question :-) – Francisco Merino Dec 29 '14 at 14:29

1 Answers1

2

Threading won't help you when you want to run a presumably CPU-intensive algorithm over some data. Try looking into the multiprocessing module, which I had great success with when working on a project that did special OCR in a 100 MB scanned image.

Consider this for solving the shared memory issue:

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    pool = Pool(processes=4)              # start 4 worker processes
    result = pool.apply_async(f, [10])    # evaluate "f(10)" asynchronously
    print result.get(timeout=1)           # prints "100" unless your computer is *very* slow
    print pool.map(f, range(10))          # prints "[0, 1, 4,..., 81]"

Take a look at pool.imap() and pool.apply_async().

Hope that helped.

whiterock
  • 190
  • 2
  • 11
  • Please explain why threading won't help, since one of the primary uses of threading is to allow processing with more than one core at a time by one program. – Mike DeSimone Dec 29 '14 at 15:09
  • It is, but in python threading is not really concurrent when it comes to splitting up the CPU power, it is good for I/O, networking and similar things. Libraries like `multiprocessing` and `subprocess` exist for a reason – whiterock Dec 29 '14 at 15:15
  • Well, that's true for CPython anyway. And it's true for that because of the GIL, which the questioner is trying to get around. So if getting around the GIL is impossible in CPython, it would be nice if there was an explanation why. BTW, the `multiprocessing` and `subprocess` modules exist for other reasons, too, such as streamlining things that were previously handled by a set of OS-dependent functions in the `os` module. – Mike DeSimone Dec 29 '14 at 15:32
  • I think it's just too complicated and dangerous to do it in a different manner. Hopefully someone upvotes my answer :/ – whiterock Dec 29 '14 at 16:12
  • 1
    @Mike: It is *not possible* to "get around the GIL" in (pure) CPython. You could write an extension module in C or Cython (or something similar), or do all the real work inside NumPy, or something like that, but if you want to write simple pure Python, it just isn't going to happen. The GIL must be held while executing bytecode, adjusting reference counts, etc., and there is no reasonable way to circumvent it from within Python. – Kevin Dec 29 '14 at 16:39
  • @whiterock As far as I know using the multiprocessing module would require copying the whole MultipleSeqAlignment object for each cpu, wouldn't it? Furthermore, if I've understood correctly, I'm doing IO on each iteration (because I'm not loading the whole MultipleSeqAlignment object into memory when I call the Bio.AlignIO.read() method at the beginning), thus I'm not sure about this algorithm been so CPU-intensive. – Francisco Merino Dec 29 '14 at 18:06
  • @Francisco: Only you know your algorithm. The rule is this: only I/O benefits from threading. If your threads do some I/O and some computation, you will only benefit from parallelism when at least one thread is actually blocked waiting for the OS to perform a `file.read()`/`.write()` call (or related calls). The rest of the time, you'll actually see a slight *worsening* from the overhead of having multiple threads at all. Test it with `timeit` and `cProfile` and make your decision. – Kevin Dec 29 '14 at 20:11
  • @Kevin Thanks Kevin! I didn't know cProfile. As I said before, I've already done some tests and apparently, there is an improvement in the execution time. I wasn't sure about this results because of the GIL. The only reason I've found is that each thread must do some IO in every iteration in order to read a new sequence and thus, the algorithm avoids the Global Interpreter Lock. – Francisco Merino Dec 30 '14 at 18:28