1

let's for example I have 3 sequences

myseq=DNAStringSet(c("ATGACGAACTGTAAAGGACTGCACGGCC",
                     "TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG",
                     "GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT"))

and the patterns I need to search are

fragments= DNAStringSet(c("ACTG","AAAA"))

counts=vcountPDict(fragments,myseq)

I want to compile the information in the form of a table with headings DNA seq, fragment and count for fragment in different columns so it can be presented well.

lmo
  • 37,904
  • 9
  • 56
  • 69
  • Probably use `sapply` or `lapply` to loop through the search terms. I'm unfamiliar with how `vcountPDict` works. If it is not vecotrized, then you may need to use two of these apply functions. – lmo Apr 27 '17 at 12:12
  • its not working..... – user7931018 Apr 27 '17 at 12:59
  • What is your end goal? Are you trying to create a vector of counts for each time a DNA series appears? Are you trying to plot this? – akash87 Apr 27 '17 at 13:25

2 Answers2

0

I just asked a question like this here How to take a word and create an indicator variable based on the word's presence in comments?

fragment <- c("ACTG","AAAA") 
sequence <- c("ATGACGAACTGTAAAGGACTGCACGGCC",
              "TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG",
              "GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT")

sequence <- data.frame(sequence, sapply(fragment, function(i) as.numeric(grepl(i, sequence))))     
sequence
                                      sequence ACTG AAAA
 1                ATGACGAACTGTAAAGGACTGCACGGCC    1    0
 2    TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG    1    1
 3 GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT    1    0

From here you can go and do

seq.counts <- colSums(sequence[,2:3])
seq.means  <- colMeans(sequence[,2:3])

Or you can do

sequence <- data.frame(sequence, sapply(fragment, function(i) str_count(sequence, i)))

Which results in

> sequence
                                     sequence ACTG AAAA
1                ATGACGAACTGTAAAGGACTGCACGGCC    2    0
2    TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG    1    2
3 GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT    1    0

and you can use colSums or colMeans.

Community
  • 1
  • 1
akash87
  • 3,876
  • 3
  • 14
  • 30
  • with due respect, sir i tried the option first option given by you, but as you see the frequency count is not accurate.... – user7931018 May 01 '17 at 13:50
0
library(Biostrings)
myseq = DNAStringSet(c(
    "ATGACGAACTGTAAAGGACTGCACGGCC",
    "TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG",
    "GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT"
))

fragments = DNAStringSet(c("ACTG","AAAA"))
counts = vcountPDict(fragments, myseq)

# then use command of expand.grid and name two of the columns
a = expand.grid(seqID = myseq, pattern = fragments)

# Another variable to account for the frequency of each pattern 
b = expand.grid(freq = counts)
r = cbind.data.frame(a, b)

The result is

> r
                                        seqID pattern freq
1                ATGACGAACTGTAAAGGACTGCACGGCC    ACTG    2
2    TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG    ACTG    0
3 GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT    ACTG    1
4                ATGACGAACTGTAAAGGACTGCACGGCC    AAAA    2
5    TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG    AAAA    1
6 GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT    AAAA    0
James Hawley
  • 153
  • 1
  • 8