More efficient way to retrieve lines from a huge file

Question

I have a datafile with the IDs of 1,786,916 records, and I would like to retrieve the corresponding records from another file that contains about 4.8 million records (in this case DNA sequences, but basically just plain text). I wrote a python script to do this, but it is taking ages to run (day 3 and it's only 12% done). Since I'm a relative newbie to python, I was wondering if anyone had suggestions for making this faster.

Here is an example of the datafile with the IDs (the ID in the example is ANICH889-10):

ANICH889-10 k__Animalia; p__Arthropoda; c__Insecta; o__Lepidoptera; f__Psychidae; g__Ardiosteres; s__Ardiosteres sp. ANIC9
ARONW984-15 k__Animalia; p__Arthropoda; c__Arachnida; o__Araneae; f__Clubionidae; g__Clubiona; s__Clubiona abboti

Here is an example of the second file that contains the records:

>ASHYE2081-10|Creagrura nigripesDHJ01|COI-5P|HM420985
ATTTTATACTTTTTATTAGGAATATGATCAGGAATAATTGGTCTTTCAATAAGAATCATTATCCGTATTGAATTAAGAAATCCAGGATCTATTATTAATAATGACCAAATTTATAATTCATTAATTACTATACACGCACTATTAATAATTTTTTTTTTAGTTATACCTGTAATAATTGGAGGATTTGGAAATTGATTAATTCCTATTATAATTGGAGCCCCAGATATAGCATTTCCACGAATAAACAATCTTAGATTTTGATTATTAATCCCATCAATTTTCATATTAATATTAAGATCAATTACTAATCAAGGTGTAGGAACAGGATGAACAATATATCCCCCATTATCATTAAATATAAATCAAGAAGGAATATCAATAGATATATCAATTTTTTCTTTACATTTAGCAGGAATATCCTCAATTTTAGGATCAATTAATTTCATTTCAACTATTTTAAATATAAAATTTATTAATTCTAATTATGATCAATTAACTTTATTTTCATGATCAATTCTAATTACTACTATTTTATTATTACTAGCAGTCCCTGTATTAGCAGGAGCAATTACTATAATTTTAACTGATCGAAATTTAAATACTTCTTTTTTTGATCCTAGAGGAGGAGGAGATCCAATTT-----------------
>BCISA145-10|Hemiptera|COI-5P
AACTCTATACTTTTTACTAGGATCCTGGGCAGGAATAGTAGGAACATCATTAAGATGAATAATCCGAATTGAACTAGGACAACCTGGATCTTTTATTGGAGATGACCAAACTTATAATGTAATTGTAACTGCCCACGCATTTGTAATAATTTTCTTTATAGTTATACCAATTATAATTGGAGGATTTGGAAATTGATTAATTCCCTTAATAATTGGAGCACCCGATATAGCATTCCCACGAATGAATAACATAAGATTTTGATTGCTACCACCGTCCCTAACACTTCTAATCATAAGTAGAATTACAGAAAGAGGAGCAGGAACAGGATGAACAGTATACCCTCCATTATCCAGAAACATCGCCCATAGAGGAGCATCTGTAGATTTAGCAATCTTTTCCCTACATCTAGCAGGAGTATCATCAATTTTAGGAGCAGTTAACTTCATTTCAACAATTATTAATATACGACCAGCAGGAATAACCCCAGAACGAATCCCATTATTTGTATGATCTGTAGGAATTACAGCACTACTACTCCTACTTTCATTACCCGTACTAGCAGGAGCCATTACCATACTCTTAACTGACCGAAACTTCAATACTTCTTTTTTTGACCCTGCTGGAGGAGGAGATCCCATCCTATATCAACATCTATTC

However in the second file, the DNA sequences are broken over several lines, rather than on a single line, and they are not always the same length.

EDIT

Here is my desired output:

>ANICH889-10
GGGATTTGGTAATTGATTAGTTCCTTTAATA---TTGGGGGCCCCTGACATAGCTTTTCCTCGTATAAATAATATAAGATTTTGATTATTACCTCCCTCTCTTACATTATTAATTTCAAGAAGAATTGTAGAAAATGGAGCTGGGACTGGATGAACTGTTTACCCTCCTTTATCTTCTAATATCGCCCATAGAGGAAGCTCTGTAGATTTA---GCAATTTTCTCTTTACATTTAGCAGGAATTTCTTCTATTTTAGGAGCAATTAATTTTATTACAACAATTATTAATATACGTTTAAATAATTTATCTTTCGATCAAATACCTTTATTTGTTTGAGCAGTAGGAATTACAGCATTTTTACTATTACTTTCTTTACCTGTATTAGCTGGA---GCTATTACTATATTATTAACT---------------------------------------------------------------------------
>ARONW984-15
TGGTAACTGATTAGTTCCATTAATACTAGGAGCCCCTGATATAGCCTTCCCCCGAATAAATAATATAAGATTTTGACTTTTACCTCCTTCTCTAATTCTTCTTTTATCAAGGTCTATTATNGAAAATGGAGCA---------GGAACTGGCTGAACAGTTTACCCTCCCCTTTCTTNTAATATTTCCCATGCTGGAGCTTCTGTAGATCTTGCAATCTTTTCCCTACACCTAGCAGGTATTTCCTCAATCCTAGGGGCAGTTAAT------TTTATCACAACCGTAATTAACATACGCTCTAGAGGAATTACATTTGATCGAATGCCTTTATTTGTATGATCTGTATTAATTACAGCTATTCTTCTACTACTCTCCCTCCCAGTATTAGCAGGGGCTATTACAATACTACTCACAGACCGAAATTTAAAT-----------------------------------

Here is the python script that I wrote to do this:

from Bio import SeqIO
from Bio.Seq import Seq
import csv
import sys

#Name of the datafile
Taxonomyfile = "02_Arthropoda_specimen_data_less.txt"

#Name of the original sequence file
OrigTaxonSeqsfile = "00_Arthropoda_specimen.fasta"

#Name of the output sequence file
f4 = open("02_Arthropoda_specimen_less.fasta", 'w')

#Reading the datafile and extracting record IDs   
TaxaKeep = []
with open(Taxonomyfile, 'r') as f1:
    datareader = csv.reader(f1, delimiter='\t')
    for item in datareader:
        TaxaKeep.append(item[0])
    print(len(TaxaKeep))    

#Filtering sequence file to keep only those sequences with the desired IDs
datareader = SeqIO.parse(OrigTaxonSeqsfile, "fasta")
for seq in datareader:
    for item in TaxaKeep:
        if item in seq.id:
            f4.write('>' + str(item) + '\n')
            f4.write(str(seq.seq) + '\n')

I think the trouble here is that I'm looping through the list of 1.7 million records names for each of the 4.8 million records. I thought about making a dictionary or something for the 4.8 million records, but I couldn't figure out how. Any suggestions (including non-python suggestions)?

Thanks!

Can you provide a sample output? Right now it is unclear what information you want to keep or discard from the two files, and what string format you'd like the retrieved information to be in. — dROOOze, Mar 25 '18 at 09:20
When feeling adventurous, you can try CUDA python: https://developer.nvidia.com/how-to-cuda-python . The caveat is that you have to rethink your data structures and algorithms so that you can avoid race conditions. It looks to me like the second block of nested for loop could be done in parallel, but you probably don't want to write straight into a file. Rather, you can write them into some SET of dictionaries or lists. — TuanDT, Mar 25 '18 at 09:40
I am writing a comprehensive answer, give me 10-20 minutes. Also, is it okay if I don't use Bio? — RinkyPinku, Mar 25 '18 at 09:42
I really don't like the idea of working on data while the file is open. There's surely no impact but don't do it. — bobrobbob, Mar 25 '18 at 09:44
In your 2nd file, are `ASHYE2081-10` and `BCISA145-10` the `seq.id` that you are checking against the ID's of the 1st file? — gboffi, Mar 25 '18 at 10:09

Tomalak · Accepted Answer · 2018-03-25T13:28:18.327

I think you can create a big performance gain by improving the look-up you do.

Using a set() can help you with that. Sets are designed to do very fast data look-up and they don't store duplicate values, which makes them an ideal choice for filtering data. So let's store all the taxonomy IDs from the input file in a set.

from Bio import SeqIO
from Bio.Seq import Seq
import csv
import sys

taxonomy_file = "02_Arthropoda_specimen_data_less.txt"
orig_taxon_sequence_file = "00_Arthropoda_specimen.fasta"
output_sequence_file = "02_Arthropoda_specimen_less.fasta"

# build a set for fast look-up of IDs
with open(taxonomy_file, 'r', newline='') as fp:
    datareader = csv.reader(fp, delimiter='\t')
    first_column = (row[0] for row in datareader)
    taxonomy_ids = set(first_column)

# use the set to speed up filtering the input FASTA file
with open(output_sequence_file, 'w') as fp:
    for seq in SeqIO.parse(orig_taxon_sequence_file, "fasta"):
        if seq.id in taxonomy_ids: 
            fp.write('>')
            fp.write(seq.id)
            fp.write(seq.seq)
            fp.write('\n')

I've renamed a few of your variables. It's completely pointless to name a variable f4 only to write "#Name of the output sequence file" in a comment above. Why not get rid of the comment and name the variable output_sequence_file directly?
(row[0] for row in datareader) is a generator comprehension. A generator is a an iterable object, which means this does not calculate the list of IDs just yet - it just knows what to do. This saves time and memory by not building a temporary list. One line later, the set() constructor, which accepts iterables, will build a set out of all the IDs in the first column.
In the second block, we use if seq.id in taxonomy_ids to check if a sequence ID should be output. in is very fast on sets.
I'm calling .write() four times rather than building a temporary string out of four items. I assume seq.id and seq.seq are strings already, so calling str() on them is not really necessary.
I don't know a lot about the FASTA file format, but a quick look at the BioPython documentation suggests that using SeqIO.write() would be the better way of creating the format.

your approach, storing ids in a set and then looking for seq.id in that set, is pretty witty. — RinkyPinku, Mar 25 '18 at 12:20
...and sets are implemented on top of dicts. They are effectively dicts with only keys and no values. — Tomalak, Mar 25 '18 at 12:28
Thanks!! This finished in 5 min! In order to make it work though, I had to modify the sequence IDs using `cut -d\| -f1 00_Arthropoda_specimen.fasta > 00_Arthropoda_specimen_headers.fasta` That got rid of all of the text after the `|` so that the look up was successful — Andreanna, Mar 25 '18 at 13:08
It's not necessary to use `cut` for that. You can do it directly in Python. (Hint, the quickest way to *"get the left part of a string value up to a certain character"* is to index into the string, like this `value[:value.index('|')]`) — Tomalak, Mar 25 '18 at 13:21

RinkyPinku · Answer 2 · 2018-03-25T12:48:22.607

You are correct in reasoning that using two nested for loops you are gonna spend time required to do a single operation for 4.8 million * 1.7 million repeats.

Which is why we will use a SQLite database to store all info contained in OrigTaxonSeqsfile. Why SQLite? Because

SQLite comes inbuilt with Python
SQLite supports indexing

I can't start explaining CS theory but indexes are a God sent when it comes to searching data in cases like yours.

Once data is indexed, you just look for each record id from Taxonomyfile in database and write it to your f4 final output file.

Following code should work just as you want, it has following advantages:

shows you progress made in number of lines processed
only needs Python 3, Bio libraries not strictly needed
uses generators so no file has to be read into the memory all at once
doesn't depend on list/set/dict, because (in this particular) case they may consume too much RAM

Here's the code

import sqlite3
from itertools import groupby
from contextlib import contextmanager

Taxonomyfile = "02_Arthropoda_specimen_data_less.txt"
OrigTaxonSeqsfile = "00_Arthropoda_specimen.fasta"

@contextmanager
def create_db(file_name):
    """ create SQLite db, works as context manager so file is closed safely"""
    conn = sqlite.connect(file_name, isolation_level="IMMEDIATE")
    cur = conn.connect()
    cur.execute("""
        CREATE TABLE taxonomy
        ( _id INTEGER PRIMARY KEY AUTOINCREMENT
        , record_id TEXT NOT NULL
        , record_extras TEXT
        , dna_sequence TEXT
        );
        CREATE INDEX idx_taxn_recID ON taxonomy (record_id);
    """)
    yield cur
    conn.commit()
    conn.close()
    return

def parse_fasta(file_like):
    """ generate that yields tuple containing record id, extra info
    in tail of header and the DNA sequence with newline characters
    """
    # inspiration = https://www.biostars.org/p/710/
    try:
        from Bio import SeqIO
    except ImportError:
        fa_iter = (x[1] for x in groupby(file_like, lambda line: line[0] == ">"))
        for header in fa_iter:
            # remove the >
            info = header.__next__()[1:].strip()
            # seprate record id from rest of the seqn info
            x = info.split('|')
            recID, recExtras = x[0], x[1:]
            # build the DNA seq using generator
            sequence = "".join(s.strip() for s in fa_iter.__next__())
            yield recID, recExtras, sequence
    else:
        fasta_sequences = SeqIO.parse(file_like, 'fasta')
        for fasta in fasta_sequences:
            info, sequence = fasta.id, fasta.seq.tostring()
            # seprate record id from rest of the seqn info
            x = info.split('|')
            recID, recExtras = x[0], x[1:]
            yield recID, recExtras, sequence
    return

def prepare_data(txt_file, db_file):
    """ put data from txt_file into db_file building index on record id """
    i = 0
    src_gen = open(txt_file, mode='rt')
    fasta_gen = parse_fasta(src_gen)
    with create_db(db_file) as db:
        for recID, recExtras, dna_seq in fasta_gen:
            db.execute("""
                INSERT INTO taxonomy
                (record_id, record_extras, dna_sequence) VALUES (?,?,?)
                """,
                [recID, recExtras, dna_seq]
            )
            if i % 100 == 0:
                print(i, 'lines digested into sql database')
    src_gen.close()
    return

def get_DNA_seq_of(recordID, src):
    """ search for recordID in src database and return a formatted string """
    ans = ""
    exn = src.execute("SELECT * FROM taxonomy WHERE record_id=?", [recordID])
    for match in exn.fetchall():
        a, b, c, dna_seq = match
        ans += ">%s\n%s\n" % (recordID, dna_seq)
    return ans

def main():
    # first of all prepare an optimized database
    db_file = txt_file + ".sqlite"
    prepare_data(OrigTaxonSeqsfile)
    # now start searching and writing
    progress = 0
    db = sqlite3.connect(db_file)
    cur = db.cursor()
    out_file = open("02_Arthropoda_specimen_less.fasta", 'wt')
    taxa_file = open(Taxonomyfile, 'rt')
    with taxa_file, out_file:
        for line in taxa_file:
            question = line.split("\t")[0]
            answer = get_DNA_seq_of(question, cur)
            out_file.write(answer)
            if progress % 100 == 0:
                print(progress, 'lines processed')
    db.close()

if __name__ == '__main__':
    main()

Feel free to ask any clarifications.
If you get any errors or if output is not as expected, please send me 200 line sample, each of Taxonomyfile and OrigTaxonSeqsfile and I'll update the code.

Speed gains

Following is a rough estimate, just talking about disk I/O, since that is the slowest part.

let a = 4.8 million and b = 1.7 million.

In old approach you would have to perform disk I/O a * b i.e. 8160 billion times.

In my approach, once you do the indexing (i.e. 2*a times), you have to search for 1.7 million records. So in my approach the total time would be, 2 * (a + b) i.e. 13 million disk I/O, which is not small either but this approach is more than 600 thousand times faster

Why not `dict()`?

I get scolded by boss and professor if I am found using too much CPU/RAM. If you own the system, a simpler dict based approach is:

from itertools import groupby

Taxonomyfile = "02_Arthropoda_specimen_data_less.txt"
OrigTaxonSeqsfile = "00_Arthropoda_specimen.fasta"

def parse_fasta(file_like):
    """ generate that yields tuple containing record id, extra info
    in tail of header and the DNA sequence with newline characters
    """
    from Bio import SeqIO
    fasta_sequences = SeqIO.parse(file_like, 'fasta')
    for fasta in fasta_sequences:
        info, sequence = fasta.id, fasta.seq.tostring()
        # seprate record id from rest of the seqn info
        x = info.split('|')
        recID, recExtras = x[0], x[1:]
        yield recID, recExtras, sequence
    return

def prepare_data(txt_file, db_file):
    """ put data from txt_file into dct """
    i = 0
    with open(txt_file, mode='rt') as src_gen:
        fasta_gen = parse_fasta(src_gen)
        for recID, recExtras, dna_seq in fasta_gen:
            dct[recID] = dna_seq
            if i % 100 == 0:
                print(i, 'lines digested into sql database')
    return

def get_DNA_seq_of(recordID, src):
    """ search for recordID in src database and return a formatted string """
    ans = ""
    dna_seq = src[recordID]
    ans += ">%s\n%s\n" % (recordID, dna_seq)
    return ans

def main():
    # first of all prepare an optimized database
    dct = dict()
    prepare_data(OrigTaxonSeqsfile, dct)
    # now start searching and writing
    progress = 0
    out_file = open("02_Arthropoda_specimen_less.fasta", 'wt')
    taxa_file = open(Taxonomyfile, 'rt')
    with taxa_file, out_file:
        for line in taxa_file:
            question = line.split("\t")[0]
            answer = get_DNA_seq_of(question, dct)
            out_file.write(answer)
            if progress % 100 == 0:
                print(progress, 'lines processed')
    return

if __name__ == '__main__':
    main()

@Andreanna to understand code start reading from main() function and go backwards. — RinkyPinku, Mar 25 '18 at 11:59
*"no csv or Bio libraries needed"* is not actually an improvement. Discarding the tried-and-tested parsers for the input file formats and rolling your own parsing code is nonsense. Writing all that data to a database first because "it supports indexing" is also unnecessary overhead. dicts and sets support "indexing" as well (in fact, they are based on exactly the same thing as a database index). — Tomalak, Mar 25 '18 at 11:59
@Tomalak made some corrections, thanks. I knew rolling my own parser is a bad idea, but I think I became oversmart. — RinkyPinku, Mar 25 '18 at 12:06
*"keys in a dict are not indexed"* That is completely, fundamentally wrong. Keys in a dict **are an index**. It's the whole point of dicts to have super fast key lookup. — Tomalak, Mar 25 '18 at 12:18
Compare: https://docs.python.org/3.6/faq/design.html#how-are-dictionaries-implemented - dict lookups are O(1) for the vast majority of cases. That's faster than your SQL table approach can do it (SQL indexes work exactly like dicts, so the lookup itself is roughly O(1) there as well, but you add the whole database interaction part to each lookup operation.) — Tomalak, Mar 25 '18 at 12:23
@Tomalak Of course dict lookups are a lot faster, but there are obvious drawbacks, about which one could care or not, depending on the situation. One is memory usage, which in case of very large files might not fit, for instance. The other is that if you have to access the lookup in multiple runs of the script, you'll have to re-read all the lookup records, while the database will still be there. Again, this might not be an issue because the data is small enough, but it's worth pointing out. — ChatterOne, Mar 25 '18 at 18:34
@ChatterOne Memory usage is an issue at some point, yes, but assuming 2 million keys with an average length of 11 characters, like the OP shows, we are at less than 70 MB (`sys.getsizeof("ANICH889-10") * 2000000 / 1024 / 1024`), so even with the overhead of the dict/set itself on top of that, we are not remotely in a worrying area. If you have to access the dict in multiple runs of the script, some sort of serialization would be nice, that's true. If re-building the dict from the CSV is indeed prohibitive, I'd try using pickle first, because the DB will be slower than a dict. — Tomalak, Mar 25 '18 at 18:45

score 0 · Answer 3 · answered Mar 25 '18 at 11:39

I have asked for clarification in a comment under your question but for now you are unresponsive (no criticism intended), so I will try to answer your question before I have to go, basing my code on the following assumptions.

In the 2nd data file, every record takes two lines, the first line being a header of sorts and the second being the ACGT sequence.
In the header line we have a prefix ">", then some fields separated by "|" and the first of these fields is the ID of the whole, two-lines record.

Under the above assumptions

# If possible, no hardcoded filenames, use sys.argv and the command line
import sys

# command line sanity check
if len(sys.argv) != 4:
     print('A descriptive error message')
     sys.exit(1)

# Names of the input and output files
fn1, fn2, fn3 = sys.argv[1:]

# Use a set comprehension to load the IDs from the first file
IDs = {line.split()[0] for line in open(fn1)} # a set

# Operate on the second file
with open(fn2) as f2:

    # It is possible to use `for line in f2: ...` but here we have(?)
    # two line records, so it's a bit different
    while True:

        # Try to read two lines from file
        try:
            header = f2.next()
            payload = f2.next()
        # no more lines? break out from the while loop...
        except StopIteration:
            break

        # Sanity check on the header line
        if header[0] != ">":
            print('Incorrect header line: "%s".'%header)
            sys.exit(1)

        # Split the header line on "|", find the current ID
        ID = header[1:].split("|")[0]

        # Check if the current ID was mentioned in the first file
        if ID in IDs:
            # your code

Because there is no inner loop, this should be 6 orders of magnitude faster... it remains to be seen if it does what you need :-)

More efficient way to retrieve lines from a huge file

3 Answers3

Speed gains

Why not dict()?

Why not `dict()`?