I stumbled upon a weird behavior of the read.alignment
function from the seqinr
R package when reading in a Clustal Omega output file. The function doesn't seem to be able to properly parse the input. Consider this as a test.clu
file:
CLUSTAL O(1.2.4) multiple sequence alignment
averylongstringofcharactersthatcanbethesequenceidentifierorjustsomemadeupstuff --AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA--------------
:::****:**:******* : ******.**.*::**:****
when i read it in and then print out the sequence name from the nam
attribute, I get:
s = read.alignment(file = "test.clu", format = "clustal", forceToLower = F)
> s$nam
[1] "averylongstringofcharactersthatcanbethesequenceidentifierorjustsomemadeupstuff" "-\n"
which then results in a messed up data frame:
> data.frame(ID = s$nam, seq = s$seq)
ID seq
1 averylongstringofcharactersthatcanbethesequenceidentifierorjustsomemadeupstuff --AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA------------
2 -\n --AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA------------
Did anyone else encounter this "feature"? I checked for invisible characters in the input file that might trigger this, but nothing
It does work with sequence identifiers that are > 10 characters (despite of what is written in the manual):
CLUSTAL O(1.2.4) multiple sequence alignment
averylongstringofcharacters --AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA--------------
:::****:**:******* : ******.**.*::**:****
produces:
> s$nam
[1] "averylongstringofcharacters"
Is there something that I'm missing? Thank you