1

I currently have a list of 300,000+ FASTQ identifier codes that are used to parse a file.

My file structure is currently set up like this:

@[FASTQ identifier] [random text]

[DNA sequence]

+

[DNA sequence quality score]

This 4 line structure is repeated throughout the file. The way my current script is set up is that I extract the FASTQ identifier from the FASTQ file and see if it exists in the list of FASTQ identifiers. If it does, then it writes it to the output file. However, The time it takes to parse these files is very slow (especially if the lists contains 1E6+ identifiers or the FASTQ file is particularly large). Is there a way to make my script process the FASTQ file faster?

Here's my section of the code that does the parsing:

with open (input_r1_file,'r') as input_file:
while True:
    title = input_file.readline()
    sequence = input_file.readline()
    extra = input_file.readline()
    quality = input_file.readline()

    input_identifier = title.split(' ')[0][1:]
    if input_identifier in alpha_identifier_list:
        output_file_r1a.write(title)
        output_file_r1a.write(sequence)
        output_file_r1a.write(extra)
        output_file_r1a.write(quality)
        alpha_identifier.remove(input_identifier)
    else:
        pass
    if len(title) == 0:
        break
superasiantomtom95
  • 521
  • 1
  • 7
  • 25
  • Use a different language? – Scott Hunter Jan 18 '18 at 23:40
  • See [this question](https://stackoverflow.com/questions/582336/how-can-you-profile-a-script) about profiling your code. There's a lot of file I/O here that is run sequentially. You could write the output file in a different thread, or store it in a different volume, or in memory until you're done. – import random Jan 18 '18 at 23:51
  • Is the 1e6 number of identifiers a static list? If so, using a database would probably help. – tripleee Jan 19 '18 at 06:19
  • 3
    Unless there is specific need to develop your own implementation, I would recommend using existing tools to parse large datasets, eg. https://academic.oup.com/bioinformatics/article/32/12/1883/1744334 – Vince Jan 19 '18 at 19:01
  • You might not know about https://bioinformatics.stackexchange.com/. – Bill Bell Jan 19 '18 at 21:49
  • If your identifier is not in the list, just skip the sequence and quality lines with `next(input_file)`. – Gregor Sturm Jan 20 '18 at 18:26
  • I suspect that the main bottleneck is looking up items in very large lists (```in alpha_identifier_list``` and ```alpha_identifier.remove()```. Converting those lists to sets would probably make a much bigger difference then fiddling around with I/O. – heathobrien Jan 22 '18 at 09:04

0 Answers0