1

I'm trying to read in a file that contains a sequence of DNA. And within my program I want to read in each subsequence of that DNA of length 4, and store it in my hashmap to count the occurence of each subsequence. For example if I have the sequence CCACACCACACCCACACACCCAC, and I want every subsequence of length 4, the first 3 subsequences would be:
CCAC, CACA, ACAC, etc.
So in order to do this I have to iterate over the string several times, here is my implementation

try
    {
        String file = sc.nextLine();
        BufferedReader reader = new BufferedReader(new FileReader(file + ".fasta")); 

        Map<String, Integer> frequency = new HashMap<>(); 

        String line = reader.readLine();

        while(line != null)
        {
            System.out.println("Processing Line: " + line);
            String [] kmer = line.split("");

            for(String nucleotide : kmer)
            {
                System.out.print(nucleotide);
                int sequence = nucleotide.length(); 
                for(int i = 0; i < sequence; i++)
                {
                    String subsequence = nucleotide.substring(i, i+5); 
                    if(frequency.containsKey(subsequence))
                    {
                        frequency.put(subsequence, frequency.get(subsequence) +1);
                    }
                    else
                    {
                        frequency.put(subsequence, 1);
                    }
                }
            }
            System.out.println();
            line = reader.readLine();
        }
        System.out.println(frequency);            
    }
    catch(StringIndexOutOfBoundsException e)
    {
        System.out.println();
    }

I have a problem when reaching the end of the string, it won't continue to process due to the error. How would I go about getting around that?

azro
  • 53,056
  • 7
  • 34
  • 70
Shah Bari
  • 33
  • 4
  • 1
    What is exactly the error? Can you edit your question to add the exception verbatim? – Perdi Estaquel Dec 10 '18 at 00:35
  • StringIndexOutOfBoundsException, when I reach the end of the string, I get an error from where the substring method is used. – Shah Bari Dec 10 '18 at 00:36
  • Could you please tell us what the exacted output is for the input `CCACACCACACCCACACACCCAC`? – Bohemian Dec 10 '18 at 00:37
  • Right, we need the full stack trace. Either remove the try, catch block or add an e.printStackTrace() to your catch block. – Perdi Estaquel Dec 10 '18 at 00:38
  • java.lang.StringIndexOutOfBoundsException: String index out of range: 61 at java.lang.String.substring(Sting.java:1963) – Shah Bari Dec 10 '18 at 00:49
  • The loop goes until the last char, and you use substring from that index to index+5 which does not exists, change to `for(int i = 0; i < sequence-5; i++)` – azro Dec 10 '18 at 00:50
  • @ShahBari Provide further detail as edits to your Question, rather than posting as Comments. – Basil Bourque Dec 10 '18 at 04:11

4 Answers4

0

Based on the title of your post...try changing the condition for your while loop. Instead of using the current:

String line = reader.readLine();
while(line != null) {
    // ...... your code .....
}

use this code:

String line;
while((line = reader.readLine()) != null) {
    // If file line is blank then skip to next file line.
    if (line.trim().equals("")) {
        continue;
    }
    // ...... your code .....
}

That would cover handling blank file lines.

Now about the StringIndexOutOfBoundsException exception you are experiencing. I believe by now you already basically know why you are receiving this exception and therefore you need to decide what you want to do about it. When a string is to be split into specific length chunks and that length is not equally divisible against the overall length if a specific file line characters then there are obviously a few options available:

  • Ignore the remaining characters at the end of the file line. Although an easy solution it's not very feasible since it would produce incomplete data. I don't know anything about DNA but I'm certain this would not be the route to take.
  • Add the remaining DNA sequence (even though it's short) to the Map. Again, I know nothing about DNA and I'm not sure if even this wouldn't be a viable solution. Perhaps it is, I simply don't know.
  • Add the remaining short DNA sequence to the beginning of the next incoming file line and carry on breaking that line into 4 character chunks. Continue doing this until the end of file is reached at which point if the final DNA sequence is determined to be short then add that to the Map (or not).

There may of course be other options and whatever they might be it's something you will need to decide. To assist you however, here is code to cover the three options I've mentioned:

Ignore the remaining characters:

Map<String, Integer> frequency = new HashMap<>();
String subsequence;
String line;
try (BufferedReader reader = new BufferedReader(new FileReader("DNA.txt"))) {
    while ((line = reader.readLine()) != null) {
        // If file line is blank then skip to next file line.
        if (line.trim().equals("")) {
            continue;
        }

        for (int i = 0; i < line.length(); i += 4) {
            // Get out of loop - Don't want to deal with remaining Chars
            if ((i + 4) > (line.length() - 1)) {
                   break;
            }

            subsequence = line.substring(i, i + 4);
            if (frequency.containsKey(subsequence)) {
                frequency.put(subsequence, frequency.get(subsequence) + 1);
            }
            else {
                frequency.put(subsequence, 1);
            }
        }
    }
}
catch (IOException ex) {
    ex.printStackTrace();
}

Add the remaining DNA sequence (even though it's short) to the Map:

Map<String, Integer> frequency = new HashMap<>();
String subsequence;
String line;
try (BufferedReader reader = new BufferedReader(new FileReader("DNA.txt"))) {
    while ((line = reader.readLine()) != null) {
        // If file line is blank then skip to next file line.
        if (line.trim().equals("")) {
            continue;
        }

        String lineRemaining = "";

        for (int i = 0; i < line.length(); i += 4) {
            // Get out of loop - Don't want to deal with remaining Chars
            if ((i + 4) > (line.length() - 1)) {
                lineRemaining = line.substring(i);
                break;
            }

            subsequence = line.substring(i, i + 4);
            if (frequency.containsKey(subsequence)) {
                frequency.put(subsequence, frequency.get(subsequence) + 1);
            }
            else {
                frequency.put(subsequence, 1);
            }
        }
        if (lineRemaining.length() > 0) {
            subsequence = lineRemaining;
            if (frequency.containsKey(subsequence)) {
                frequency.put(subsequence, frequency.get(subsequence) + 1);
            }
            else {
                frequency.put(subsequence, 1);
            }
        }
    }
}
catch (IOException ex) {
    ex.printStackTrace();
}

Add the remaining short DNA sequence to the beginning of the next incoming file line:

Map<String, Integer> frequency = new HashMap<>();
String lineRemaining = "";
String subsequence;
String line;
try (BufferedReader reader = new BufferedReader(new FileReader("DNA.txt"))) {
    while ((line = reader.readLine()) != null) {
        // If file line is blank then skip to next file line.
        if (line.trim().equals("")) {
            continue;
        }
        // Add remaining portion of last line to new line.
        if (lineRemaining.length() > 0) {
            line = lineRemaining + line;
            lineRemaining = "";
        }

        for (int i = 0; i < line.length(); i += 4) {
            // Get out of loop - Don't want to deal with remaining Chars
            if ((i + 4) > (line.length() - 1)) {
                lineRemaining = line.substring(i);
                break;
            }

            subsequence = line.substring(i, i + 4);
            if (frequency.containsKey(subsequence)) {
                frequency.put(subsequence, frequency.get(subsequence) + 1);
            }
            else {
                frequency.put(subsequence, 1);
            }
        }
    }
    // If any Chars remaining at end of file then
    // add to MAP
    if (lineRemaining.length() > 0) {
        frequency.put(lineRemaining, 1);
    }
}
catch (IOException ex) {
    ex.printStackTrace();
}
DevilsHnd - 退職した
  • 8,739
  • 2
  • 19
  • 22
0

You are calling substring(i, i + 5). At the end of the string i + 5 goes out of bounds. Let's say your string is "ABCDEFGH", length 8, your loop will go from i = 0 to i = 7. When i reaches 4 substring(4, 9) cannot be computed and the exception is raised.

Try this:

for(int i = 0; i < sequence - 4; i++)
Perdi Estaquel
  • 819
  • 1
  • 6
  • 21
0
  • You can directly read each line and extract first 4 sub-chars without the need to splitting it up each time when you read a line.

The error you are getting because when the Program is looping through the splitted characters then it is possible that there are less than 4 characters left altogether at the end to be extracted. Less than 4 chars are responsible which is throwing the error. e.g. suppose you have a line CCACACC then grouping in 4 chars you would get 1st group as complete i.e., CCAC and 2nd group as ACC which is incomplete. So in your code when the line nucleotide.substring(i, i+5); is encountered then probably there is no group of complete 4 characters left at the end that can be extracted and hence the Program throws error. And to extract 4 chars you need to add 4, not 5.

So the work around the code will be to put the extraction line in a try block as given below in the edited code. Replace the loop body with the below code.

while(reader.hasNextLine())
{
    line = reader.nextLine();
    for(int i = 0; i < line.length; i++)
    {
        String subsequence = "";
        // put the extract operation in a try block
        // to avoid crashing
        try
        {
            subsequence = nucleotide.substring(i, i+4); 
        }
        catch(Exception e)
        {
            // just leave blank to pass the error
        }

        if(frequency.containsKey(subsequence))
        {
            frequency.put(subsequence, frequency.get(subsequence) +1);
        }
        else
        {
            frequency.put(subsequence, 1);
        }
    }
Anjan
  • 364
  • 1
  • 9
-1

It is not clear at all from the question description, but I'll guess your input file ends with an empty line.

Try removing the last newline in your input file, or alternatively check against empty in your while loop:

while (line != null && !line.isEmpty())
Perdi Estaquel
  • 819
  • 1
  • 6
  • 21
  • The problem lies here: String subsequence = nucleotide.substring(i, i+5); If I reach the end of the string, then it creates a problem – Shah Bari Dec 10 '18 at 00:50