3

I have a dataset separated by tab, so I would want to convert the following dataset into a matrix

CATGGGGAAAACTGA
CCTCTCGATCACCGA
CCTATAGATCACCGA
CCGATTGATCACCGA
CCTTGTGCAGACCGA

I used to use

rbind(strsplit("CATGGGGAAAACTGA","")[[1]],
        strsplit("CCTCTCGATCACCGA","")[[1]],
        strsplit("CCTCTCGATCACCGA","")[[1]],
        strsplit("CCTATAGATCACCGA","")[[1]],
        strsplit("CCGATTGATCACCGA","")[[1]],
        strsplit("CCTTGTGCAGACCGA","")[[1]])

And this produces:

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "C"  "A"  "T"  "G"  "G"  "G"  "G"  "A"  "A"  "A"   "A"   "C"   "T"   "G"   "A"  
[2,] "C"  "C"  "T"  "C"  "T"  "C"  "G"  "A"  "T"  "C"   "A"   "C"   "C"   "G"   "A"  
[3,] "C"  "C"  "T"  "C"  "T"  "C"  "G"  "A"  "T"  "C"   "A"   "C"   "C"   "G"   "A"  
[4,] "C"  "C"  "T"  "A"  "T"  "A"  "G"  "A"  "T"  "C"   "A"   "C"   "C"   "G"   "A"  
[5,] "C"  "C"  "G"  "A"  "T"  "T"  "G"  "A"  "T"  "C"   "A"   "C"   "C"   "G"   "A"  
[6,] "C"  "C"  "T"  "T"  "G"  "T"  "G"  "C"  "A"  "G"   "A"   "C"   "C"   "G"   "A"

But when the dataset is very large, this process is exhausting. How could I do it automatically?

zx8754
  • 52,746
  • 12
  • 114
  • 209
user_012314112
  • 324
  • 1
  • 2
  • 10
  • 1
    use `do.call`: something like `do.call("rbind", lapply(myDNAVec, strsplit, split=""))`. – lmo Nov 18 '16 at 13:17
  • Is the sequence lengths fixed, always 15? – zx8754 Nov 18 '16 at 13:27
  • 2
    @lmo No need for `lapply`. `strsplit(myDNAvec, split = '')` will work. – Konrad Rudolph Nov 18 '16 at 13:29
  • 1
    _Possibly_ relevant Q&A: [Faster way to read fixed-width files in R](http://stackoverflow.com/questions/24715894/faster-way-to-read-fixed-width-files-in-r) – Henrik Nov 18 '16 at 13:33
  • 2
    @KonradRudolph Thanks. `lapply` creates a needless nest and probably additional overhead. `do.call(rbind, strsplit(myDNAvec, split = ''))` is better. – lmo Nov 18 '16 at 13:33

1 Answers1

5

You could use read.fwf to split into single characters:

read.fwf(textConnection("CATGGGGAAAACTGA
CCTCTCGATCACCGA
CCTATAGATCACCGA
CCGATTGATCACCGA
CCTTGTGCAGACCGA"), rep(1, nchar("CATGGGGAAAACTGA")))
#  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
#1  C  A  T  G  G  G  G  A  A   A   A   C   T   G   A
#2  C  C  T  C  T  C  G  A  T   C   A   C   C   G   A
#3  C  C  T  A  T  A  G  A  T   C   A   C   C   G   A
#4  C  C  G  A  T  T  G  A  T   C   A   C   C   G   A
#5  C  C  T  T  G  T  G  C  A   G   A   C   C   G   A

You might want to pass a file name instead of a text connection.

Roland
  • 127,288
  • 10
  • 191
  • 288