How can I make my Python script faster?

Question

I'm pretty new to Python, and I have written a (probably very ugly) script that is supposed to randomly select a subset of sequences from a fastq-file. A fastq-file stores information in blocks of four rows each. The first row in each block starts with the character "@". The fastq file I use as my input file is 36 GB, containing about 14,000,000 lines.

I tried to rewrite an already existing script that used way too much memory, and I managed to reduce the memory usage a lot. But the script takes forever to run, and I don't see why.

parser = argparse.ArgumentParser()
parser.add_argument("infile", type = str, help = "The name of the fastq input file.", default = sys.stdin)
parser.add_argument("outputfile", type = str, help = "Name of the output file.")
parser.add_argument("-n", help="Number of sequences to sample", default=1)
args = parser.parse_args()


def sample():
    linesamples = []
    infile = open(args.infile, 'r')
    outputfile = open(args.outputfile, 'w')
    # count the number of fastq "chunks" in the input file:
    seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
    # randomly select n fastq "chunks":
    seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
    # make a list of the lines that are to be fetched from the fastq file:
    for i in seqsamples:
        linesamples.append(int(4*i+0))
        linesamples.append(int(4*i+1))
        linesamples.append(int(4*i+2))
        linesamples.append(int(4*i+3))
    # fetch lines from input file and write them to output file.
    for i, line in enumerate(infile):
        if i in linesamples:
            outputfile.write(line)

The grep-step takes practically no time at all, but after over 500 minutes, the script still hasn't started to write to the output file. So I suppose it is one of the steps in between grep and the last for-loop that takes such a long time. But I don't understand which step exactly, and what I can do to speed it up.

You should consider [profiling](https://docs.python.org/2/library/profile.html) your program to see which steps hang. Also have you tried running your code on a smaller file and see if it runs to completion? One other step I would consider later in optimizing is using threading and multiprocessing to div up the work. — Jerome Anthony, Dec 04 '14 at 22:10
Don't call `int` all the time inside the loop. Also, use the `with` statement. — Veedrac, Dec 05 '14 at 13:08

vikramls · Accepted Answer · 2014-12-04T22:43:59.730

2

Depending on the size of linesamples, the if i in linesamples will take a long time since you are searching through a list for each iteration through infile. You could convert this into a set to improve the lookup time. Also, enumerate is not very efficient - I have replaced that with a line_num construct which we increment in each iteration.

def sample():
    linesamples = set()
    infile = open(args.infile, 'r')
    outputfile = open(args.outputfile, 'w')
    # count the number of fastq "chunks" in the input file:
    seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
    # randomly select n fastq "chunks":
    seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
    for i in seqsamples:
        linesamples.add(int(4*i+0))
        linesamples.add(int(4*i+1))
        linesamples.add(int(4*i+2))
        linesamples.add(int(4*i+3))
    # make a list of the lines that are to be fetched from the fastq file:
    # fetch lines from input file and write them to output file.
    line_num = 0
    for line in infile:
        if line_num in linesamples:
            outputfile.write(line)
        line_num += 1
    outputfile.close()

edited Dec 04 '14 at 22:43

answered Dec 04 '14 at 22:17

vikramls

1,802
1
11
15

I think this answer correctly identifies the bottleneck `if i in linesamples`. Explicitly closing the file handles would be probably a good idea, though :) – cel Dec 04 '14 at 22:29
3

`enumerate` is not efficient? Do you have any benchmarks to prove it? – Matthias Dec 05 '14 at 07:32
I agree with @Matthias; my timings show `enumerate` to be faster on both 2.7 and 3.4 on CPython. In any case, the overhead is negligible compared to the real overheads. – Veedrac Dec 05 '14 at 12:52
-1 for spreading false information. `enumerate` is **faster**. What you did is just replace some optimized C code handled by `enumerate` with several python bytecodes. Moreover, it should be preferred even for its intended semantics. A generic counter can do many things, but the return value of `enumerate` has a specific meaning. – Bakuriu Mar 21 '15 at 11:46

score 1 · Answer 2 · answered Dec 04 '14 at 22:21

You said that grep finishes running quite quickly, so in that case instead of just using grep to count the occurrences of @ have grep output the byte offsets of each @ character it sees (using the -b option for grep). Then, use random.sample to pick which ever blocks you want. Once you've chosen the byte offsets you want, use infile.seek to go to each byte offset and print out 4 lines from there.

Atilla Ozgur · Answer 3 · 2014-12-04T22:26:27.827

0

Try to parallelize your code. What I mean is this. You have 14,000,000 lines of input.

Work your grep and filter your lines first and write it to filteredInput.txt
Split your filteredInput to 10.000-100.000 lines of files, like filteredInput001.txt, filteredInput002.txt
Work our code on this split files. Write your Output to different files such as output001.txt, output002.txt
Merge your results as final step.

Since your code is not working at all. You may also run your code on these filtered inputs. Your code will check existence of filteredInput files and will understand which step he was in, and resume from that step.

You can also use multiple python process this way (after step 1) using your shell or python threads.

edited Dec 04 '14 at 22:26

answered Dec 04 '14 at 22:16

Atilla Ozgur

14,339
3
49
69

1

It's probably not a good idea to suggest parallelizing before optimizing the algorithm. With the right algorithm IO will be the bottle neck, not the CPU. – cel Dec 04 '14 at 22:23
@cel His code is not even working right now, but splitting problem and parallelizing is not a good idea. – Atilla Ozgur Dec 04 '14 at 22:27

score 0 · Answer 4 · answered Mar 20 '15 at 23:28

You can use the Reservoir Sampling algorithm. With this algorithm you read through the data only once (no need to count the lines of the file in advance), so you can pipe data through your script. There's example python code in the Wikipedia page.

There's also a C implementation for fastq sampling in Heng Li's seqtk.

How can I make my Python script faster?

4 Answers4