0

I have 2 ~59GB text files in ".fastq" format. fastq files are genomics read files from a sequencer. Every 4 lines is a new read, but the lines are of variable size.

The filesize is roughly 59GB, and there are about 211M reads-- which means, give or take, approximatley 211M*4 = 844M lines. The program I'm using, Bowtie, currently has the ability to do the following options:

"--skip 105M --qupto 105M"

which essentially means "skip the first 105M reads and only process up to the next 105M reads." In this way you can break up processing of the file. The problem is, the way that it does the skipping is incredibly slow. It just reads the first 105M reads as it normally would, but doesn't process them. Then it starts comparisons once it gets to the read value it was given.

I am wondering if I can use something like C/C++'s fsetpos to set the position to the middle of the file [or wherever] which I realize will probably put me somewhere in the middle of a line, and then from there find the beginning of the first full read to start processing rather than waiting for it to read approximately 422M lines until it gets where it needs to go. Does anybody have experience doing fsetpos on such a large file, and know whether or not the performance is any better than it is how it's currently doing it?

Thanks-- Nick

Community
  • 1
  • 1
HodorTheCoder
  • 254
  • 2
  • 11
  • 2
    How would you know what line you were on? Open up a random book, find an arbitrary letter on one of the pages. How many sentences come before that letter? – Joe Sep 27 '12 at 19:18
  • Could you use something instead to preprocess the file into multiple files, by doing so in a single pass? Not sure how much it would help, but it would save extra double-reading of skipped lines if you had more than 2 chunks. – Joe Sep 27 '12 at 19:20
  • Why don't you just try it and see how it works? – 001 Sep 27 '12 at 19:21
  • 1
    The nature of the fastq file is that every four line snippet/read has a format that would allow me to figure out where I was. In other words, it's like opening a book to a random arbitrary letter on one of the pages and then backing up until you reach the beginning of the sentence, only in the case of the fastq files, the reads are numbered. So it'd be like if every sentence in the book started with "+[3220243]" or something indicating that is the sentence number. – HodorTheCoder Sep 27 '12 at 19:37

1 Answers1

0

Yes, you can position to the middle of a file using C++.

For huge files, the performance is usually better than reading the data.

In general, the process for positioning within a file:

  1. A request is made to read the directory entry for the file.
  2. The directory is searched to find the track and sector for the file position.
  3. Note: Some filesystems may have directory extensions for large files, thus more data will need to be read.
  4. On the next read, the hard drive is told to go to the given track and sector, then read in data.

You are saving time from all the previous data to pass through the communications port and into memory (or ignored).

Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
  • OK-- this is good. [I am doing this over NFS as well, but it still has to be faster than reading sequential lines.] Is what you just described essentially what happens when one uses fgetpos()? I will google "positioning within a file linux". Thanks. – HodorTheCoder Sep 27 '12 at 23:00
  • @NickLindberg: If you think it is good, then click the checkmark next to this. :-) – Thomas Matthews Sep 27 '12 at 23:20
  • Good call. Could you provide an example of your methodology? I mean, I wasn't as much wondering how this would work in an ideal scenario [meaning I figured it would work as you described] but an example code snippet or library/system call is kind of what I was hoping for. fgetpos() where I'm headed but I'm not entirely sure it behaves in the fashion you describe above. – HodorTheCoder Sep 28 '12 at 03:32
  • @NickLindberg: This is in general. A good place to start would be the GNU Compiler source code and the Linux kernel. Also search for hard drive composition. Use either `ftellg` or `fgetpos`. – Thomas Matthews Sep 28 '12 at 15:22
  • Oops. I mean fsetpos, not fgetpos. – HodorTheCoder Oct 01 '12 at 17:46