I have a fasta file that I want to parse into an ArrayList
, each position having an entire sequence. The sequences are multiline strings, and I don't want to include the identification line in the string that I store.
My current code splits each line into another position in the ArrayList
. How do I make it so that each position is delineated by the >
character?
The fasta files are of the form:
>identification of a sequence 1
line1
line3
>identification of a sequence 2
line4
>identification of a sequence 3
line5
line6
line7
public static void main(String args[]) {
String fileName = "fastafile.fasta";
List<String> list = new ArrayList<>();
try (Stream<String> stream = Files.lines(Paths.get(fileName))) {
//1. filter line 3
//2. convert all content to upper case
//3. convert it into a List
list = stream
.filter(line -> !line.startsWith(">"))
.map(String::toUpperCase)
.collect(Collectors.toList());
} catch (IOException e) {
e.printStackTrace();
}
list.forEach(System.out::println);
}
For the above example, we would want an output such that:
System.out.println(list.size()); // this would be 3
System.out.println(list.get(0)); //this would be line1line3
System.out.println(list.get(1)); //this would be line4
System.out.println(list.get(2)); //this would be line5line6line7