Fast lookup in large datasets using python

Question

I am processing the human genome and have ~10 million SNPs (identified by a "SNP_ID") in a single patient. I have two reference TSV's which contain rows, each row contains a SNP_ID and a floating point number (as well as lots of other metadata), it is all in ASCII format. These reference TSV's are 300-500GB in size.

I need to filter the 10 million SNPs based on criterion contained within the TSVs. In other words find the row with the SNP_ID, lookup the floating point number and decide if the value is above a threshold.

My thought is to store the SNPs in a python set, then do a scan over each TSV, doing a lookup to see if the row in the TSV matches any object in the set. Do you think this is a reasonable approach, or will the lookup time in the set with 10 million items be very slow? I have hundreds of patients this needs to be done over so it shouldn't take more than an hour or two to process.

Given that strings are pretty big, floats are small, and most of the data is redundant, the TSVs might actually be quite small when brought into memory. Then you can do an ordinary join with a package like `pandas` — BallpointBen, Sep 05 '19 at 01:59

score 0 · Answer 1 · answered Sep 05 '19 at 01:07

Your data size is large enough that you should not be working with data structures in memory. Instead, consider using a relational database system. You can start with sqlite, which comes bundled with Python.

This SO answer has details about how to load a TSV into sqlite.

After your set of SNPs and your reference TSVs are in sqlite, you can filter the SNPs with a simple SQL query such as:

SELECT
    t1.SNP_ID
FROM
    snps t1
LEFT JOIN
    ref_tsv t2
ON
   t1.SNP_ID = t2.SNP_ID
WHERE
    t2.value >= {your_threshold}
;

score 0 · Answer 2 · answered Sep 05 '19 at 01:50

ok, here's what I would do in your case.

500GB of metadata is a lot, let's look how can we reduce this amount.
your idea to make a set() with SNP_ID is good. Read all your SNP data, make a set of SNP_ID, it will definitely fit into the memory
then read TSV data, for every row check if SNP_ID is in your set, if it is -- save the SNP_ID and the floating point number, discard the rest. You will have 10M records at the most, because one SNP has only that much.
do your magic
start over with the next SNP

It would be nice to put all the data on a fast SSD just in case.

And, something else to try, maybe if you discard the metadata, you will be able to reduce the TSV size to just a few gigabytes, saving the SNP_ID and the float? Then you may easily fit it into the memory and make things much faster.

Fast lookup in large datasets using python

2 Answers2