I am processing the human genome and have ~10 million SNPs (identified by a "SNP_ID") in a single patient. I have two reference TSV's which contain rows, each row contains a SNP_ID and a floating point number (as well as lots of other metadata), it is all in ASCII format. These reference TSV's are 300-500GB in size.
I need to filter the 10 million SNPs based on criterion contained within the TSVs. In other words find the row with the SNP_ID, lookup the floating point number and decide if the value is above a threshold.
My thought is to store the SNPs in a python set, then do a scan over each TSV, doing a lookup to see if the row in the TSV matches any object in the set. Do you think this is a reasonable approach, or will the lookup time in the set with 10 million items be very slow? I have hundreds of patients this needs to be done over so it shouldn't take more than an hour or two to process.