I currently have a list of 300,000+ FASTQ identifier codes that are used to parse a file.
My file structure is currently set up like this:
@[FASTQ identifier] [random text]
[DNA sequence]
+
[DNA sequence quality score]
This 4 line structure is repeated throughout the file. The way my current script is set up is that I extract the FASTQ identifier from the FASTQ file and see if it exists in the list of FASTQ identifiers. If it does, then it writes it to the output file. However, The time it takes to parse these files is very slow (especially if the lists contains 1E6+ identifiers or the FASTQ file is particularly large). Is there a way to make my script process the FASTQ file faster?
Here's my section of the code that does the parsing:
with open (input_r1_file,'r') as input_file:
while True:
title = input_file.readline()
sequence = input_file.readline()
extra = input_file.readline()
quality = input_file.readline()
input_identifier = title.split(' ')[0][1:]
if input_identifier in alpha_identifier_list:
output_file_r1a.write(title)
output_file_r1a.write(sequence)
output_file_r1a.write(extra)
output_file_r1a.write(quality)
alpha_identifier.remove(input_identifier)
else:
pass
if len(title) == 0:
break