0

I stumbled upon a weird behavior of the read.alignment function from the seqinr R package when reading in a Clustal Omega output file. The function doesn't seem to be able to properly parse the input. Consider this as a test.clu file:

CLUSTAL O(1.2.4) multiple sequence alignment


averylongstringofcharactersthatcanbethesequenceidentifierorjustsomemadeupstuff      --AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA--------------
                                                                                                               :::****:**:******* : ******.**.*::**:****

when i read it in and then print out the sequence name from the nam attribute, I get:

s = read.alignment(file = "test.clu", format = "clustal", forceToLower = F)
> s$nam
[1] "averylongstringofcharactersthatcanbethesequenceidentifierorjustsomemadeupstuff" "-\n"

which then results in a messed up data frame:

> data.frame(ID = s$nam, seq = s$seq)
                                                                              ID                                                                                                                seq
1 averylongstringofcharactersthatcanbethesequenceidentifierorjustsomemadeupstuff --AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA------------
2                                                                            -\n --AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA------------

Did anyone else encounter this "feature"? I checked for invisible characters in the input file that might trigger this, but nothing

It does work with sequence identifiers that are > 10 characters (despite of what is written in the manual):

CLUSTAL O(1.2.4) multiple sequence alignment


averylongstringofcharacters      --AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA--------------
                                                                                                               :::****:**:******* : ******.**.*::**:****

produces:

> s$nam
[1] "averylongstringofcharacters"

Is there something that I'm missing? Thank you

Marius
  • 990
  • 1
  • 14
  • 34

1 Answers1

0

Not really having an elegant solution, but at least the problem lies in the C functions that are under the hood of seqinr. A hardcoded value.. surprise, surprise..

From the read_clustal_align function within the alignment.c file:

 SEXP read_clustal_align(SEXP ficname)
{

  SEXP list;
  SEXP listseq;
  SEXP listname;
  SEXP nombreseq;
  char *fname;
  FILE *in;
  char line[200], *p;
  int i, l = 0, curr_spec, first=TRUEL, curr_len, next_len, tot_spec, curr_max_len =0, carac, wid_name = 0;
  char **seq, **comments, **seqname = NULL;
....

more specifically char line[200] which pretty much expects any line (identifier + sequence) in the Clustal input file to have a maximum of 200 characters. Why only 200? I don't know. Maybe in 1984 when they got out the first version of the package it made sense, but Clustal is able to deal with sequences much longer than that.

As a workaround, one can extract those files, up the 200 limit, re-compile, and build it as a custom package following: https://www.r-bloggers.com/2021/07/using-r-packaging-a-c-library-in-15-minutes/

If you are interested in just the identifiers and sequences from Clustal, like myself, then you can use this parser:

read_clustal = function(file){
  
  clustal = read.csv2(file, sep = "\n", skip = 3, header = F)
  clustal = data.frame(entries = clustal[-nrow(clustal), ]) # remove consensus
  clustal = unlist(strsplit(clustal$entries, "\\s+"))
  
  return(data.frame(ID = clustal[seq(1, (length(clustal) - 1), by = 2)],
                    seq = clustal[seq(2, (length(clustal)), by = 2)]))
}

Hooray for hardcoded values!

Marius
  • 990
  • 1
  • 14
  • 34