1

I’m reading large files containing DNA sequences. These are long stretches of characters and I need a certain subset from somewhere in the file (I have the start- and stop position). Since the files are so big, I use BufferedReader() to read. This reads one line at a time but the subset I want might be on more than one. The start- and stop positions I have only make sense if the whole DNA sequence was represented as one line with no newlines. In practise, these files do contain newlines unfortunately. So for every line the indices are from 0 to end and not 0 to 20, 21 to 40, 41 to 60 etc. for example.

My problem/question: Reading a file line by line, but saving a subset/substring that might be across multiple lines. I tried several methods but can’t extract the substring I want. I suspect my own logic/thinking is flawed, or there is a method I am not yet aware of. Is there a better way to do this?

Method 1:

public String getSubSequence() {


        fileLocation = "genome.fna";
        String referenceGenomeSub = "";
        int passedLetters = 0;
        int passedLines = 0;

        //start- and stop position
        int start = 50;
        int stop = 245;

        Path path = Paths.get(fileLocation);



        try (BufferedReader br = Files.newBufferedReader(path, Charset.defaultCharset())){

            String line;

            while ((line = br.readLine()) != null) {

                if (!line.startsWith(">")) {//Don't need lines that start with >

                    passedLines++;

                    //edit indices so I don't get out of bounds
                    if (linesPassed != 1) {
                        start = start - passedLetters;
                        stop = stop - passedLetters;
                    }


                    //this is to know where I am in the file
                    passedLetters = passedLetters + line.length();


                    //if the subset is on only one line
                    if (start < passedLetters && stop <= passedLetters) {                        
                        referenceGenomeSub = referenceGenomeSub.concat(line);                        
                    }


                    //subsequence is on multiple lines
                    else if (start <= passedLetters && stop > passedLetters) {
                        referenceGenomeSub = line.substring(start);
                    }
                    else if (passedLetters > stop && !referenceGenomeSub.isEmpty()) {
                        referenceGenomeSub = referenceGenomeSub.concat(line.substring(0, stop));
                    }

                }

            }
            br.close();

        } catch (IOException e) {
            System.out.println("Error: " + e.getMessage());
        }

    }
}

Here I try to keep track of the number of characters I already have passed. This how I know when I’m in range of the desired substring.
Result: StringIndexOutOfBoundsException

My other method is to save all lines up until the line with my stop position. I then extract a substring. This is not prefered as my esired subset might be at the end of the file.

Conditions:
- Memory friendly
- No BioJava if possible. I’m still learning to program so I want to do this without. Even if it’s the hard way

Not looking for fixed code but some article/example to get me on the right track is perfectly fine. I'm looking at my screen for hours now without making progress and my mind is a bit of a blank now. As I said, the problem might be flawed thinking or being oblivious to much better methods/techniques for this job.

Iarwain
  • 199
  • 10
  • The question is not a duplicate; however, the code does have a bug in that the !="" isn't the right way of comparing strings. – AlBlue May 06 '16 at 14:16
  • @AlBlue I added the most probable closing reason, but you could also close as "off-topic -> pleez fix my codez" et al. – Mena May 06 '16 at 14:25
  • @Mena In my defence, I'm not asking for fixed code. I'm perfectly happy with references to questions/articles that might get me on the right track (because I'm here to learn). I just sense my thinking/logic is not right and decided to share my code to give some insight in my thinking. While I appreciate your reference, my question is more about reading files – Iarwain May 06 '16 at 14:33
  • @Iarwain understood, and that's also why I haven't downvoted your question. In my opinion, it needs some clarification and summarizing work - maybe [this](http://stackoverflow.com/help/mcve) will help. – Mena May 06 '16 at 14:38
  • @Mena I edited the question to hopefully clarify my problem. Is this more like it? – Iarwain May 13 '16 at 09:00
  • @Iarwain I re-opened your question now. – Mena May 13 '16 at 09:24
  • Your `StringIndexOutOfBoundsException` is likely to take place here: `referenceGenomeSub = referenceGenomeSub.concat(line.substring(0, stop));`, and because `stop` is `>` `line.length()`. – Mena May 13 '16 at 09:26

0 Answers0