3

I have a fasta file that I want to parse into an ArrayList, each position having an entire sequence. The sequences are multiline strings, and I don't want to include the identification line in the string that I store.
My current code splits each line into another position in the ArrayList. How do I make it so that each position is delineated by the > character?

The fasta files are of the form:

>identification of a sequence 1
line1
line3
>identification of a sequence 2
line4
>identification of a sequence 3
line5
line6
line7
public static void main(String args[]) {

        String fileName = "fastafile.fasta";
        List<String> list = new ArrayList<>();

        try (Stream<String> stream = Files.lines(Paths.get(fileName))) {

            //1. filter line 3
            //2. convert all content to upper case
            //3. convert it into a List
            list = stream
                    .filter(line -> !line.startsWith(">"))
                    .map(String::toUpperCase)
                    .collect(Collectors.toList());

        } catch (IOException e) {
            e.printStackTrace();
        }

        list.forEach(System.out::println);


    }

For the above example, we would want an output such that:

System.out.println(list.size()); // this would be 3

System.out.println(list.get(0)); //this would be line1line3

System.out.println(list.get(1)); //this would be line4

System.out.println(list.get(2)); //this would be line5line6line7
Rann Lifshitz
  • 4,040
  • 4
  • 22
  • 42
Sam
  • 33
  • 3

1 Answers1

1

Using Files.lines seems to make things a little bit trickier, based on your goal.

Assuming you can simply get the entire content of the file in a single String - the following works quite well (verified using an online compiler):

import java.util.*;
import java.util.stream.*;


public class Test {
   public static void main(String args[]) {
     String content = ">identification of a sequence 1\n" +
        "line1\n" +
        "line3\n" +
        ">identification of a sequence 2\n" +
        "line4\n" +
        ">identification of a sequence 2\n" +
        "line5\n" +
        "line6\n" +
        "line7";
     List<String> list = new ArrayList<>();
     try {
        list = Arrays.stream(content.split(">.*"))
          .filter(e -> !e.isEmpty())
          .map(e -> e.replace("\n","").trim())
          .collect(Collectors.toList());
     } catch (Exception e) {
         e.printStackTrace();
     }

     list.forEach(System.out::println);

     System.out.println(list.size()); // this would be 3

     System.out.println(list.get(0)); // this would be line1line3

     System.out.println(list.get(1)); // this would be line4

     System.out.println(list.get(2)); // this would be line5line6line7

   }
}

And the output is:

line1line3
line4
line5line6line7
3
line1line3
line4
line5line6line7
Rann Lifshitz
  • 4,040
  • 4
  • 22
  • 42
  • 2
    I’d use `.replace("\\R","")` to make the code independent of the particular form of line break sequences used in the file. Regarding `Files.lines`, this is rather a job for `Scanner`, e.g. see [this Q&A](https://stackoverflow.com/q/48216161/2711488). Or look at [this answer](https://stackoverflow.com/a/49446094/2711488) regarding an example of identifying lines starting with a particular token. Though, the pattern would be much simpler here, as the delimiter ought to be consumed, e.g. `"^>"`. – Holger Apr 29 '19 at 07:46