I’m reading large files containing DNA sequences. These are long stretches of characters and I need a certain subset from somewhere in the file (I have the start- and stop position). Since the files are so big, I use BufferedReader()
to read. This reads one line at a time but the subset I want might be on more than one. The start- and stop positions I have only make sense if the whole DNA sequence was represented as one line with no newlines. In practise, these files do contain newlines unfortunately. So for every line the indices are from 0 to end and not 0 to 20, 21 to 40, 41 to 60 etc. for example.
My problem/question: Reading a file line by line, but saving a subset/substring that might be across multiple lines. I tried several methods but can’t extract the substring I want. I suspect my own logic/thinking is flawed, or there is a method I am not yet aware of. Is there a better way to do this?
Method 1:
public String getSubSequence() {
fileLocation = "genome.fna";
String referenceGenomeSub = "";
int passedLetters = 0;
int passedLines = 0;
//start- and stop position
int start = 50;
int stop = 245;
Path path = Paths.get(fileLocation);
try (BufferedReader br = Files.newBufferedReader(path, Charset.defaultCharset())){
String line;
while ((line = br.readLine()) != null) {
if (!line.startsWith(">")) {//Don't need lines that start with >
passedLines++;
//edit indices so I don't get out of bounds
if (linesPassed != 1) {
start = start - passedLetters;
stop = stop - passedLetters;
}
//this is to know where I am in the file
passedLetters = passedLetters + line.length();
//if the subset is on only one line
if (start < passedLetters && stop <= passedLetters) {
referenceGenomeSub = referenceGenomeSub.concat(line);
}
//subsequence is on multiple lines
else if (start <= passedLetters && stop > passedLetters) {
referenceGenomeSub = line.substring(start);
}
else if (passedLetters > stop && !referenceGenomeSub.isEmpty()) {
referenceGenomeSub = referenceGenomeSub.concat(line.substring(0, stop));
}
}
}
br.close();
} catch (IOException e) {
System.out.println("Error: " + e.getMessage());
}
}
}
Here I try to keep track of the number of characters I already have passed. This how I know when I’m in range of the desired substring.
Result: StringIndexOutOfBoundsException
My other method is to save all lines up until the line with my stop position. I then extract a substring. This is not prefered as my esired subset might be at the end of the file.
Conditions:
- Memory friendly
- No BioJava if possible. I’m still learning to program so I want to do this without. Even if it’s the hard way
Not looking for fixed code but some article/example to get me on the right track is perfectly fine. I'm looking at my screen for hours now without making progress and my mind is a bit of a blank now. As I said, the problem might be flawed thinking or being oblivious to much better methods/techniques for this job.